Skip to topic | Skip to bottom
Home

Tesi
Tesi.ElISAPentaformatoTesir1.37 - 05 Mar 2008 - 15:21 - SilvioPeronitopic end

Start of topic | Skip to actions

Conversione automatica di documenti: un modello e un'implementazione

Author: Silvio Peroni

Supervisor: Fabio Vitali

Keywords: PML,Pentaformat,Document segmentation,Pattern,XML

03/03/2008

In Xanadu did Kubla Khan

A stately pleasure-dome decree:

Where Alph, the sacred river, ran

Through caverns measureless to man

Down to a sunless sea.

from Kubla Khan or, A Vision in a Dream. A Fragment. by Samuel Taylor Coleridge

Sommario

Questa tesi propone un meccanismo basato su regole per la segmentazione di documenti XML secondo un modello chiamato Pentaformato. Questo modello permette di identificare ogni elemento di un qualsiasi documento testuale come appartenente ad una o più delle cinque dimensioni indicate dal modello stesso: contenuto, struttura, presentazione, comportamento dinamico e metadato. Lo scopo ultimo di questo lavoro è di permettere la conversione di un documento segmentato in un nuovo documento usando tutte o una parte di queste dimensioni.

Dopo una breve panoramica sul contesto tecnologico in cui ci poniamo, presentiamo il primo risultato di questo nostro lavoro: il linguaggio XML Pentaformat Markup Language o PML. Esso permette di associare una o più dimensioni del Pentaformato ad ogni nodo di un documento XML tramite un'apposita dichiarazione. Attraverso l'uso di questo linguaggio abbiamo sviluppato un motore che permette la segmentazione di qualsiasi documento XML sulla base di determinate regole strutturali: elISA 2.0 (Extraction of Layout Information via Structural Analysis). Queste regole sono contenute in un ulteriore documento XML e sono facilmente modificabili ed estendibili in conformità con una specifica grammatica da noi sviluppata.

Il documento XML segmentato restituito da elISA 2.0 è il punto di partenza per realizzare la conversione in altri formati sulla base delle dimensioni del Pentaformato individuate per il documento iniziale. In particolare, in questo lavoro ci siamo concentrati sulla conversione da PML in IML (ISAWiki Markup Language). IML è il formato di memorizzazione per documenti usato in ISAWiki, una piattaforma client/server che implementa il concetto di editabilità globale per tutto il Web in modo che ogni utente registrato possa modificare a piacimento il contenuto di quasiasi pagina. L'idea è quella di inserire elISA 2.0 all'interno di questa piattaforma in modo da individuare il contenuto di un qualsiasi documento web. Tutte le modifiche apportate ad una pagina e tutte le nuove pagine create vengono memorizzate da ISAWiki come documenti IML. Il nostro obiettivo è quello di convertire il documento PML ritornato da elISA 2.0 in un documento IML.

IML è un formato basato su due delle cinque dimensioni del Pentaformato: contenuto e struttura. Ad una prima analisi la conversione da PML in IML può sembrare semplice se si considera che bisogna passare da un modello a cinque dimensioni ad uno con un sottoinsieme dimensionale più piccolo. Ma il numero di dimensioni non è l'unica differenza tra PML e IML. Quest'ultimo, infatti, prevede una strutturazione ferrea del contenuto basata su sette determinati pattern strutturali. Per definizione, il Pentaformato (e conseguentemente PML) non prevede una gerarchizzazione per quel che concerne le sue dimensioni: l'autore è libero di strutturare il proprio documento come meglio crede.

Il problema è il seguente: per convertire un documento PML in IML è necessario patternizzare in qualche modo il primo in base ai sette pattern strutturali propri del secondo. In modo da realizzare questa conversione, abbiamo sviluppato un ulteriore motore che permette la patternizzazione di un qualsiasi documento XML sulla base di determinate regole. Come per elISA 2.0, anche le regole per questo nuovo motore, contenute in un documento XML, sono facilmente modificabili ed estendibili in conformità con una specifica grammatica da noi sviluppata. Attraverso questa operazione di patternizzazione, si riesce ad ottenere un documento PML perfettamente patternizzato secondo il modello dei pattern strutturali usati in IML.

Il risultato finale presentato come conclusione di questo lavoro è un'applicazione web, chiamata elISA Server Side, che implementa tutto il processo di conversione attraverso l'uso consecutivo di elISA 2.0 e del motore di patternizzazione. Partendo da un comune documento web, elISA 2.0 identifica le dimensioni ad esso relative restituendo un documento PML. Su quest'ultimo viene poi applicata una patternizzazione strutturale in modo da ottenere un nuovo documento PML conforme ai sette pattern strutturali usati in IML. Infine si converte questo nuovo documento PML patternizzato in un documento IML composto da tutto il contenuto individuato da elISA 2.0 nel documento iniziale.

I benefici introdotti dal nostro lavoro sono fondamentalmente due. Il primo riguarda la segmentazione di documenti. Tramite l'identificazione penta-dimensionale dei ruoli dei vari elementi di un documento XML, si possono realizzare nuovi documenti partendo solo da alcune delle dimensioni individuate. La conversione da PML ad IML rappresenta un esempio di questa operazione: tramite il processo di conversione realizzato nel nostro lavoro, si riesce ad ottenere un nuovo documento prendendo in considerazione soltanto le dimensioni contentuto e struttura del primo. Chiaramente, questo ragionamento si può applicare anche ad altre dimensioni. Ad esempio potremmo decidere di usare esclusivamente la struttura e la presentazione di un documento per creare una sorta di template per la creazione di nuovi documenti, e così via.

L'altro beneficio introdotto dal nostro lavoro riguarda i pattern strutturali. Sugli elementi di un documento che si sa essere patternizzato si possono effettuare semplici operazioni di deduzione in modo da capire, analizzando la loro struttura, quale sia il loro pattern e, conseguentemente, cosa possano o non possano contenere. In questo caso risulta più semplice, ad esempio, individuare quali elementi possano prevedere del contenuto testuale - solo alcuni pattern, infatti, ammettono del testo al loro interno - e quali siano, invece, gli elementi usati per la sola strutturazione logica del documento. Il motore di patternizzazione che abbiamo realizzato permette di assumere di avere un documento patternizzato su cui applicare queste deduzioni automatiche.

1 Once upon a time (or Introduction)

In this thesis we propose a rule-based mechanism to segment XML documents according to a five-dimensional model called Pentaformat [Dii07]* in order to convert automatically them in new documents using one or more of constituents introduced by the model: content, structure, presentation, behaviour and metadata.

The Pentaformat is a model suggested by Di Iorio in his Ph.D. thesis [Dii07]*. This model concerns the recognition of the roles that the elements of any document can have. The goal of this recognition is to segment a document according to five particular constituents in order to reuse parts of it in different contexts. Every constituent - also called dimension - represents a point of view on the document that we analyze.

We can identify as content all the informations written by the author of the document. For example we can take into consideration a common newspaper article: in this case the content of this document is the article itself leaving out all the typographical elements such as the font family, the font size, et cetera. We can associate these last kinds of elements to presentation. Presentation is the dimension that concerns how the document look like. Font attributes, title layout, content placement, spaces between parts of the document, additional information not related to content: all these items concern the presentation. Presentation concerns only the layout of elements. It does not take care of the logical organization of content: presentation lays out paragraphs, titles, images but it does not identify what role they have in the document. This distinction is what the structure dimension refers to. We ordinarily use structures - such as paragraphs, containers, headers, inline elements - to arrange content. The goal of this dimension is to identify what are these structures, leaving out all hierarchical relations among them.

Content, presentation and structure are not the only points of view for a document. There is also information about the document itself. In a newspaper article there are some items, such as the heading, that are not only content of the document but also that define particular relations between themselves and the document. We can take into consideration the content A that represents the author of the article. This particular content does not only belong to the content dimension but it also define a relation between itself and the document: A is the author of the document. This kind of relation is what the metadata dimension refers to. While the previous four dimensions - content, presentation, structure and metadata - can been applied for any document [GM02]*, the last one - behaviour - are especially related to the digital documents. It identifies all items specifying interaction or dynamism for the document or its parts, such as links or scripts for web pages. It is important to understand that even if these dimensions are completely different, they are also connected. For example, an element that structurally is a picture can be treated as content of the document as well, the content of an element “h1” in a web document can be the title - from the point of view of the metadata - of the document itself, and so on.

The Pentaformat model is the model that we have used to develop a segmentation tool for XML documents. We have chosen it because we think it is the best model to segment this kind of document, such as (X)HTML documents - that often are characterized by all the five dimensions. We perform the segmentation using a rule-based mechanism to identify what are the dimensions associated to the elements of a document. The reason for using a rule-based tool to complete this process is the following: if we have rules that segment any (X)HTML document, and in a next version of this language its structure will be changed, we can re-write our rules according to the new language definition without any change of the tool core.

The segmentation mechanism we have developed, that represent the contribution of my work, is one of the tool used in ISA* [Dii07]*. As we can see in Picture 1, ISA* is an architecture - developed at the University of Bologna and applied in scenarios ranging from web editing, to e-learning, to book printing - that offers a general purpose model to structure frameworks based on document transformation and analysis. This architecture get in input a digital document in order to segment it according to the Pentaformat model. This segmentation can be used by an application logic in order to use the five dimensions or a subset of them for example to convert the input document to another format or to reformat the document using another presentation.

The tool developed for segmenting documents can be used in this architecture for the phases concerning the pre-parsing (the generation of well-formed XML), post-parsing (adding or removing features) and content analysis (document segmentation). One of the frameworks developed at the University of Bologna that uses the ISA* architecture is ISAWiki [DV04]*. It is a framework that implements the concept of global editability for web pages on the model of Ted Nelson's Xanadu project [Nel80]*. To complete this goal, ISAWiki includes a client application and a server application in order to let signed users edit any web page and store it in an appropriate server. In order to identify what parts of a web document users can modify, this platform uses an engine called elISA (Extraction of Layout Information via Structural Analysis) [DVV04]* for the segmentation of all the web documents according to two main dimensions: content and (a small set of) presentation. The main goal of this engine is to extract content from any web document in order to convert it into an IML (ISAWiki markup language) document [San06]*: this is the format used by ISAWiki to store documents. The new planned version of ISAWiki will take into consideration the document segmentation according to all Pentaformat dimensions. Our tool for segmenting, called elISA 2.0, is what this new ISAWiki version can use to perform this five-dimensional segmentation. In this context, a multi-dimensional point of view allows to use any combination of these dimensions to perform some operations on the input document such as document conversion, presentation replacement, filtering content and so on.

Picture 1 - The ISA* architecture

To complete the main goal of ISAWiki - the global editability - we need to convert automatically the segmented document returned by elISA 2.0 into an IML document. This is not an easy process because, besides the dimensional issue, there is another great difference between the output of elISA 2.0 and IML: the former, according to the Pentaformat model, does not force any hierarchical order for structures, while the latter complies to a structural pattern theory [DDD07]* in order to arrange content. This theory is based on seven patterns that we can use to structure any XML document: marker (an empty element whose meaning depends on its position or its existence), atom (an element that can contain text only), inline and block (elements that can contain text and repeatable inline/atom/marker), container (an element that can contain any element except inline), table (it contains homogeneous non-inline elements), record (a sequence of optional but non-repeatable and non-inline elements). Then we need to pattern the output of elISA 2.0 before to convert it in an IML document. For this reason we have developed another rule-based engine called patterning engine. It allows to pattern any XML document using a set of patterning rule-based on some patterning operations. The conjoined use of these two engines allows to convert any web document in IML documents preserving all the informations related to the Pentaformat dimensions.

We can enlarge this introduced context in order to illustrate the main field in which we have accomplished our work. The ISA* architecture is a general model that includes the analysis of digital documents in order to recognize the roles of their elements. This analysis concerns the extraction of data especially referred to content extraction [BBC07b]* of web documents. Before discussing this matter we must understand what content is in a digital document. Considering the Web context, we can give two intuitive definitions of content: it is what the author of a web document has written (leaving out all data added by automatic processes); or it is what users search googling. Using these intuitive definitions, we can introduce some examples of web documents in order to recognize in a visual way - for example looking a web page - whether an item is or is not content. Understanding whether a picture of a web page is related to content or describes only a presentational item represents a significant example of this kind of recognition. The main point of more works - such as [LLY03]*, [CGG04]* and [AHR01]* - is to understand what is the content of a web document, leaving out all the remaining non-content items. We understand that content is the most important thing of a document. However we think the recognition of the roles of other non-content elements is important too. What these works do not comply to is a multi-dimensional model, such as the Pentaformat, for the recognition of roles of all elements of a web document.

The last argument is another reason why we have developed elISA 2.0 according to the Pentaformat model. In order to use this model to segment web documents we have developed a new language to make declarations about the elements of XML documents: the Pentaformat Markup Language or PML. A pml declaration is formed by four main items:

  • the Pentaformat dimension that characterizes the declaration;
  • a name that describes the chosen dimension;
  • a reference to the element which declaration refers to;
  • the content, i.e., the value of the declaration.

The output of elISA 2.0 is a PML document that is identical to the input document but it has some pml declarations specified by qualified elements. Applying the elISA 2.0 output to the patterning engine we obtain a patterned PML document that we can convert easily into an IML document. This whole conversion process is performed by a web application called elISA Server Side. We have developed it to allow this conversion from any browser. This web application represents the final production of my work: it implements the goal that we have brought forward in the beginning of this chapter.

The rest of the thesis is structured as follows. In Chapter 2 we will discuss some content extraction techniques, introducing the matter about content and illustrating some works related to this issue. In Chapter 3 we will introduce our language (PML) used to segment XML documents according to the Pentaformat model. In addition to that we will illustrate the main features of the two developed engine (elISA 2.0 and the patterning engine) used to convert a web page in a IML document. In Chapter 4 we will deepen the architecture description of these engines and we will introduce the web application that uses them to perform the conversion process from a web page to IML. Final remarks and ideas for future work are in Chapter 5.

2 On the way to Content Extraction

In order to understand the technological context of our work we need to clarify some concepts concerning the extraction of data from digital documents. In this chapter we are going to introduce content extraction [BBC07b]* from web documents as the main technological context in which we work to develop this thesis. Content extraction means to identify relevant content of a web document excluding all other data. The first issue is to understand what content is. We can describe it using two different definitions: content is what the author of a document has written; or it is what users search googling. As we will see in Section 2.1 in depth, in most cases these two definitions coincide.

Another issue is related to understand when images or tables are properly content or when we can regard them as presentation. This point concerns a markup analysis of all tags used to insert an image or a table in a web document but also a pattern recognition analysis [KT06]* of the documents in order to understand if the image is a logo, a banner, a picture related to the content, or if the table is genuine or presentational [Bag04]*. We will deepen this matter in Section 2.2.

As we know a web document is not formed by the content and presentational elements only. In addition to that there are other items, such as metadata [Nis04]*, that play important roles in the Web context. Introduced by a specific element, such as the tag “meta” for (X)HTML [JLR99]*, or by a standard language (RDF [BM04]*, OWL [BDH04]*, RDFa [AB07]*, microformats [All07]*) they are used to solve more easily some tasks, for example helping users to find what they want in the Web. All these technologies represent the basis of the Semantic Web [BHL01]*: it is an ambitious project promoted by W3C to express web content in a format that can be read and used by automated tools. In Section 2.3 we discuss in depth these issues.

The last topic that we want to discuss concerns a selection of some articles about methods for data extraction referred in particular to the extraction of content from web documents. We will give a small explanation of them in order to introduce a work carried out by Gottron [Got07]* in which he analyzes the performances of each methods. We introduce these issues in Section 2.4.

In the conclusions of this chapter (Section 2.5) we will re-introduce briefly all those subjects concerning the problem of content extraction and we will emphasize the shortcomings that the existing solutions have dealing with. This will justify our approach to the problem.

2.1 What users want (or What is content?)

First of all to understand what we mean with “content extraction” we must explain what is the content of a web document. We can try to describe it using two different definitions:

  1. content is what an author of a document has written;
  2. content is what users search googling.

To understand the first definition we can think about an article of a web newspaper such as The New York Times. In this case we can identify easily what is the content: if the contents of a newspaper are articles then the content of an article is the article itself. But in a web newspaper, and consequently in all its articles, there are some presentational aspects that do not belong to the content of the article but they are inserted by some automatic processes. We can take a look at Picture 2 thinking what were the reasons that motivated the users to read that article. Probably they got into it from the home page because they were captured by the headline or they were interested to all the articles written by Michael M. Gordon and Stephen Farrell (the authors of that article). But principally readers click on an article link for one main reason: they want to read what the author has written about the topic. Nothing else. So they know unconsciously what is the content of the article because it is what they want. When a reader read the article there is the content only: the text of the article, all the images or other items related to it. Everything else - any menu, banner, logo, publicity, video, et cetera - is not related to the article, it is_presentation_.

In the context of search engines we have defined content as what people search. This definition is quite true because usually search engines make their searching on some pieces of content, not all. They perform a sort of content extraction during the indexing process [BP00]* in which they collect the most relevant parts of each document. This kind of extraction concerns a few but meaningful parts of the content of a web document that we call information [Flo05]*. Considering a datum as “a putative fact regarding some differences or lacks of uniformity within some contexts”, Floridi define the information as a tripartite definition:

  • D consists of one or more data;
  • the data in D are well-formed, i.e. all data are clustered together correctly, according to the rules (syntax) that govern the chosen system, code or language being analysed;
  • the well-formed data in D are meaningful.

Picture 2 - “Iraq Lacks Plan on the Return of Refugees, Military Says” from The New York Times

According to this definition, we call information extraction [BBC07b]* a process that automatically extracts data having a pragmatic meaning for a certain domain. This kind of extraction is what the indexing process of search engines performs.

Picture 3 - Google results for “iraqi” query

Then when users look for some contents using specific keywords, search engines look for relevant informations about these keywords and returns some plausible results. Obviously it is not sure that the content results returned by a search engine is what users want. For example suppose that an user wants to find the article in Picture 2 remembering the word “Iraqi” and the website only.

Google search engine return this article as the third result, as we can see in Picture 3. With one simple keyword, it finds the content that the (imaginary) user wants. Not all the search engines return exactly the same results. For example in Picture 4 we show the results of Yahoo search engine for the same query. In this case all the results returned concern the “iraqi” word, so they refer to the domain that the user wants. But there is not the result. Then the user, if he wants to find the article come hell or high water, can perform another search or can use one of the suggestions proposed by the search engine, such as “iraqi flag”.

Picture 4 - Yahoo results for “iraqi” query

The same result as Yahoo is returned by Microsoft Live Search with a little difference. As we can see in Picture 5, the only suggestion from the search engine corrects the word “iraqi” with “iraq”. Though the two words are similar, “iraq” is not what the users want.

Regardless of results, all the search engines are able to identify the correct context for the queries basing their assumptions on the content of web documents: this is important.

We have just explained what the word “content” means. But, sometimes, it is not simple to decide if particular elements of (X)HTML documents, for example images or tables, are content or not because they can be used for many purposes. In next section (Section 2.2) we try to explain when these elements refer to the content and when they refer to the presentation.

Picture 5 - Microsoft Live Search results for “iraqi” query

2.2 What about images and tables

In Section 2.1 we have explained what content is and because it is so important in the context of web applications such as search engines: it is “what users want”. In this section we keep discussing about content but we focus on all items of a web document that are not text, for example images and tables.

As we know, not all the images of a web document are properly content. As we can see in Picture 6, there are (at least) two images: the first one (top-left) is the logo of the website; the second one (middle-right) is a picture about the object of the article, Tim Berners-Lee. Are both content? Obviously the answer is no because the logo is inserted by an automatic process (the wiki engine itself) but the picture has been specified by one of the authors of the article. For a human being the distinction is clear but it is more difficult if we want to distinguish their roles using an automatic process.

Picture 6 - The article of Wikipedia about Tim Berners-Lee

There are two main approaches to identify the real role of an image of a web document:

  • analyzing all the metadata about the tag “img” related to any specific image, also considering the location that it has in the source of the document;
  • applying some pattern recognition algorithms trying to disambiguate the content of the image in order to understand what is the role related to it.

The first approach works well if and only if there is enough context around the image. For example if an image is inside an element classified as content of the article, such as a “div” with an attribute “class” setted to “bodyContent”, is more likely to think it as content. On the contrary, if an image is in the first 20% of the structure of a web document [DGK02]*, probably it is the logo of the website.

The second approach is based on pattern recognition theory [KT06]*. As Koutroumbas and Theodoridis say, “pattern recognition is the scientific discipline whose goal is the classification of objects into a number of categories or classes_”. A specific application of this discipline concerns the classification (so the disambiguation) of images. On the basis of this theory, Choochaiwattana _et al [CNS07]* suggest an heuristic approach to classify all the images of any web page according to four categories: human images, icons, banners, scenic images.

Mixing both these approaches we can make a program that tries to distinguish automatically whether an image of a web document is content or not. Probably, in most cases, it will work well. The problem is that in the Web 2.0 era we want to classify not only images but also other multimedia objects such as animations or videos. Obviously their classification is more difficult than images classification because they are not static. To deepen this argument we can read [EFH02]* and [MRS02]*.

A similar issue about disambiguation concerns the use of tables [JLR99]*. In most cases the element “table” of (X)HTML is used by web designers to arrange the layout of a web page as well as to display tabular data. Probably this dual use has been triggered by a weak definition for the element itself: “the HTML table model allows authors to arrange data - text, preformatted text, images, links, forms, form fields, other tables, etc. - into rows and columns of cells”. We want to point out a specific matter: the authors of the previous definition used the word “data” and with this word we identify not only the content but also all presentational elements of a web document. So using a table for layout is included by the definition.

By the way, the same authors have specified in another document [CJV99]* that all the “tables should be used to mark up truly tabular information” and not “to lay out pages”. One reason to avoid this use of layout table is related to people operating with a screen reader to surf the web. Probably the users using screen readers are not interested to any presentational element. In this case the sentence “content is what users want” is even more true.

In this context it is useful to have an automatic mechanism that allows to identify all the layout tables of a web document. From another point of view, the identification of layout tables means to understand which table is properly content and which is not. A possible approach to solve this issue has been suggested by Vitali et al [DVV04]* and Bagnasco [Bag04]*, using a rule-based solution that tries to identify whether a table is genuine (data table) or not (layout table).

After we have discussed what content is and how it concerns not only text but also structured elements such as images and tables, we want to point out that a web document is not only formed by content or presentation but it is formed by structural items and metadata too. We will deepen this issue in Section 2.3.

2.3 There is not content only

In Section 2.1 and in Section 2.2 we have analyzed what content is and what elements in the context of web documents the content refers to. Specifically, our dissertation concerns the identification of the real role of elements of a web document in general - such as text, images and tables - in order to understand if we can consider a particular element as content or as presentation. For an example we can think about the difference between data tables and layout tables.

We think that the distinction of two different roles is not enough to make a good segmentation of a web document. For example let us consider metadata [Nis04]*. For a general definition metadata are data about data, i.e. a piece of information that describes, explains, locates another information resource. We use them every day in every context. Think about a book such as “Alice's Adventures in Wonderland” [Car65]*. In the text we distinguish two kind of information: information contained into the book - the content - and information about the book - the metadata. For example the name “Lewis Carroll” is a part of the contents of the book (it is on the book cover) but also we consider it as the author of the book itself. This information about a name - “the author of the book Alice's Adventures in Wonderland is Lewis Carroll” - belongs to the metadata set related to the book. In this case other metadata can be the title, the edition, the publishing house, the release data and so on.

In an (X)HTML document we can define metadata about it using the tag “meta” [JLR99]*. Specifying a property (through the attribute “name”) and a value (through the attribute “content”) we can make an easy metadata declaration about the document. For example look at Code 1. In this case the sentence “the author of Alice's Adventures in Wonderland is Lewis Carroll” is defined using the simple metadata declaration .

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Alice's Adventures in Wonderland</title>
        <meta name="author" content="Lewis Carroll" />
    </head>
    <body>
        <h1>Alice's Adventures in Wonderland</em></h1>
        <h2>Chapter 1: Down the Rabbit Hole</h2>
        <p>
            Alice was beginning to get very tired of sitting by her sister 
            on the bank, and of having nothing to do: once or twice she had 
            peeped into the book her sister was reading, but it had no 
            pictures or conversations in it, <q>and what is the use of a 
            book,</q> thought Alice, <q>without pictures or conversation?</q>
        </p>
        <p>
            So she was considering, in her own mind (as well as she could,
            for the hot day made her feel very sleepy and stupid), whether
            the pleasure of making a daisy-chain would be worth the trouble
            of getting up and picking the daisies, when suddenly a White 
            Rabbit with pink eyes ran close by her.
        </p>
        
        [...]
        
    </body>
</html>

Code 1 - “Alice's Adventures in Wonderland” in a XHTML document

As we know, these metadata are not showed directly in a web page so a user who reads the text does not realize if there are some metadata specified or not. So, why do we use metadata if users do not view them? The answer is simple: to help machines to retrieve information [Rij79]* about a document. For example, in the context of search engines all metadata are very important. They use them to reply to a query in the best possible way; in fact, evaluating metadata search engines have more meaningful information in order to work out the results. For this reason we can reformulate the previous answer: we use metadata in order to allow machines to help users.

In the context of web documents there are other languages that allow to define metadata or relations about something. The need to define relations between elements is so important that Tim Berners-Lee et al [BHL01]* have thought to postulate a sort of new version of the Web based on the use of these kind of relations. The result is known as Semantic Web that “provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries”. This framework is originally based on two main languages: the Resource Description Framework or RDF [BM04]* and the Web Ontology Language or OWL [BDH04]*.

RDF is an XML language that allows to make relations among items of an XML document using the triple subject-predicate-object. This triple describes a directed graph in which the subject and the object are the nodes and the predicate is the arc that connects the subject to the object. For example we can think the previous sentence “the author of Alice's Adventures in Wonderland is Lewis Carroll” as “Lewis Carroll” (subject) “is author of” (predicate) “Alice's Adventures in Wonderland” (object). We use RDF to represent information in the Web.

OWL is a language that extends RDF in order to define ontologies [Gru92]*. An ontology is an explicit specification of a conceptualization, i.e. a description of the concepts and relationships that can exist between classes of items. For example let's suppose that we have three classes such as “Dogs”, “Cats” and “Animals”. In this context many relations can exist among these classes: “Dogs” and “Cats” are subclasses of “Animals”, “Dogs” hate “Cats”, “Cats” are slier than “Dogs”, and so on. All these relations are made by the RDF triples or using appropriate OWL constructs.

The work-in-progress ambitious goal of the Semantic Web is a one by one step process, as described in the “layer-cake” in Picture 7. As the first step the users need to understand how and why they can use these technologies. This step is so difficult because RDF and OWL are not easy to understand as (X)HTML. A possibly solution is to make some programs, such as the Calais project by Reuters, that take an XML document in order to enrich it with semantic informations.

Picture 7 - The Semantic Web Layer Cake (© W3C - CC-BY-3.0)

Another possible solution is to get users used to technologies related to Semantic Web which are easier technology related. We can think to the RDFa [AB07]* (based on RDF) or to the microformats [All07]*. They are languages (X)HTML embedded that allow to define easily relations among items of a web document using some attributes of (X)HTML language such as “href”, “link”, “property”, “content” or “class”.

As we have seen, the extraction of metadata referred to web documents has a great impact for the audience (users, companies, communication) in these years. For this reason we identify clearly more than two roles for segmenting web documents - not all is content or presentation in the Web - in order to consider other constituents such as metadata. Some studies have made different approaches that allow the segmentation of a web document based on the identification of content, presentation or other constituents. We will analyze some of them in Section 2.4.

2.4 How to extract data

In the previous sections we have analyzed what the content extraction is. To understand our explanation we have defined some concepts such as content, information and datum. After that we have seen some examples of “what is content” in web documents in order to understand what is the difference between content and presentation. To introduce and explain this subject we have used as examples some typical items of a web document such as text, images and tables. But not all the elements of a web document concern content or presentation: also other items, such as metadata, are important because, through automated tools, they help users to find “what they want”. In this section we introduce some works related to extraction of data - in particular about content extraction - in order to see the possible approaches that we can use to complete this operation.

A first work [LLY03]* concerns the individuation of noise of a common web document. Lan Yi et al define two different types of noise: a global noise related to a point of view with large granularity such as duplications of the same pages through a mirror, old versions of a same page, et cetera; there is also a local noise - which is the topic of their investigation - that concerns all the parts of a web document that are disconnected to the content, for example navigation menus, banners or ads. Their approach is based on the analysis of the Document Object Model (DOM) [BCL04]* of a web page considering the following assumption: in all the pages of a web site, such as a commercial web site, there are some structures that never change (menus, logos, et cetera) and that tend to follow the same layout, because it is (often) generated automatically. Their goal is to identify these kind of structures - the local noise of a web page - in order to filter them end to obtain the content alone.

Gupta et al [CGG04]* introduce a tool, called Crunch, that uses a structural analysis based on the DOM of web pages in order to identify their content. This framework allows to handle a web page using filters that are completely customizable. Moreover it defines an application programming interface (API) in order to extend it with other filters and plugins.

Rahman et al [AHR01]* mix the structural analysis with a contextual analysis of the different zones of a web document in order to reformat the important content of it for devices with small screen such as PDAs or cellular phones. They propose a five-steps approach:

  1. analyze the structure of the web document;
  2. split the web document into sub-documents basing on the analyzed structure;
  3. analyze each sub-document considering its specific context;
  4. summarize each sub-document in order to make a table of contents (TOC) for the original document that follows the original sub-document order;
  5. sort this TOC on the basis of the relative importance of each sub-document.

Another tool that identifies the content of a web document and some presentational elements - logos, banners, ads, navigation menus, et cetera - is called elISA (Extraction of Layout Information via Structural Analysis) [DVV04]*. It was developed by the University of Bologna and it is based on XSL Transformations (XSLT) [Cla99]*, a transformation language for XML documents. This engine uses an XML document with a user defined rule-set - based on the structure of web documents - and a meta stylesheet to produce a new stylesheet to segment the input document. All rules of the rule-set document are written through XPath 1.0 [CD99]*, a language that allows to address parts of an XML document. In the context of XPath queries, some empiric studies such as [AKK06]* suggest to use relative XPath expressions to address XML nodes of a document because they are more robust than absolute expressions. This robustness concerns the sensitiveness related to changes of the structure of an XML document: Kowalkiewicz et al say that, if the structure of a document changes, the possibility that a relative XPath expression has to remains true is higher than the possibility that an absolute expression has.

In the context of some of these works and other investigations - for example [FKS01]*, [BCC02]* and [CMO05]* - Gottron [Got07]* has carried out, using his framework, an interesting study about the performance of these methods for content extraction. He concludes its article suggesting the Branstein et al's work (with some adaptations) as the best performing methods.

The only claim of this section has been to introduce some working methods and tools to extract data from web documents. The following conclusions of the chapter Section 2.5 will summarize all the issues related to data extraction in order to introduce our contributions about these topics.

2.5 So what?

In this chapter we have introduced the main context in which we work: the extraction of data from web documents. To explain this, in Section 2.1 we have clarified what is the content of a web document and why it is so important. To understand what differences there are between content and presentation, in Section 2.2 we have introduced some examples related to the possible roles that images and tables can assume. Then we have shown (Section 2.3) how web documents are not formed by content and presentational elements alone but they can also have much more information specified, such as metadata: these kinds of information are much useful to improve the quality of results of search engines. After this explanation about content, we have introduced some methods and tools to extract data from web documents, especially focused on methods to extract content, and we have concluded the section (Section 2.4) introducing an analysis [Got07]* about the performance of some of these methods.

All these methods concern the recognition of content leaving out the analysis of the other non-content elements of a document. The following is their shortcoming: in most cases these methods analyze a document considering a flat model - an element it is or it is not content, that is all. We think that the analysis of the content is important; but to identify the roles of all the non-content elements is important too. For this reason in Chapter 3 we will introduce our work related to the segmentation of web documents based on a five-dimesional model to segment any document - digital or not - called Pentaformat [Dii07]*. We will introduce an implementation of this model for XML documents and we will describe two different engines that use this implementation to segment and transform web documents according to the Pentaformat model.

3 Pentaformat Markup Language and other stories

In the previous chapter (Chapter 2) we have illustrated the technological context where we have developed this thesis. We have discussed about data extraction and its related topics: what we mean for content of web documents, why processes of content extraction are so important for Web users, what kind of other different constituents, such as presentation or metadata, we can identify to characterize the elements of a web document.

Considering this context and remembering the claim of our work - “we propose a rules-based mechanism in order to segment XML documents” (Chapter 1) - in this chapter we propose a language to specify the roles that each element of an XML document can have. Before the introduction of this language we introduce the model which it complies to: the Pentaformat [Dii07]*. This model is used to segment any document - digital or not - according to five different connected constituents called dimensions: content, structure, presentation, behaviour and metadata. We will deepen the description of this model in Section 3.1 introducing an example.

Then we will introduce the first result of our thesis: the Pentaformat Markup Language or PML. This is a language defined in Relax NG [Oas01]* that allows to segment XML documents according to the Pentaformat model. In Section 3.2 we will define the PML terminology and syntax and we will show a couple of examples.

To develop an engine capable of segmenting XML documents using PML is one of the main goals of this thesis. To complete this purpose we have re-written the Extraction of Layout Information via Structural Analysis (or elISA) [DVV04]* - the same rule-based engine introduced in Section 2.4 - in order to handle PML. The old version of elISA identifies the content and some presentational elements of a web document: it transforms the input document - through the application of a meta-XSLT document [Kay07]* with some rules - into a new XML document with the content and the presentational elements identified. Indeed the version that we have developed (elISA 2.0) segments XML documents according to the Pentaformat model. Applying three different meta-XSLT documents with some rules, the input document is transformed in a new XML document with some dimensional declarations expressed through PML tags. We will introduce in depth this new engine in Section 3.3.

elISA - both the old and the new version - is not only an engine to segment XML document; it is also an important module of an ambitious project developed by the University of Bologna: ISAWiki [DV04]*. This is a client/server platform, inspired by Ted Nelson's Xanadu project [Nel80]*, in which every signed user can create, modify or reuse any web page using a client side editor called ISAWiki editor. It uses elISA to identify the editable content of a web page. All pages created or modified through this editor are saved on an ISAWiki server in an intermediate language called ISAWiki Markup Language or IML ( [San06]*). It is used to store the structured content of the document alone, leaving out the presentation. The goal is to transform the output of elISA 2.0, a PML document, into an IML document. This operation seems easy at the surface because we have to transform a document with five constituents into another with only two constituents: the content and the structure. It is not true because IML - but not PML - has another important feature: it complies to seven structural patterns [Gub04]* to organize the content. In order to produce a pattern-compliant PML document we have developed a specific language called PML patterns. PMLp allows to rebuild an XML document according to some pattering operations. Using another rule-based engine we transform the input PML document in a patterned PML+PMLp document. The latter is easily transformable into an IML document through a simple meta-XSLT. We will sicuss in depth these issues in Section 3.4.

At the end of this chapter (Section 3.5) we will briefly re-introduce all matters concerning our works.

3.1 The Pentaformat model

As we have already seen during the introduction of the chapter, in order to understand the language that we have developed to segment XML documents, we must present the model that our language complies to. The Pentaformat [Dii07]* is a model that can be used to segment any kind of documents (not only digital ones). It also allows to re-introduce its data (or parts of them) in different contexts, such as the layout adaptation of a specific web page for a “small-screen” device.

First of all we will describe the five constituents, called dimensions, that characterize the model. Even though they are distinguished, these dimensions - content, structure, presentation, behaviour, metadata - are connected too. We will explain their characterization in Section 3.1.1.

After that, we will discuss why we use a five-dimensional model to segment documents. We will justify the choice to exclude a “content-structure-presentation” model [GM02]* to handle any document and we will introduce what kinds of benefits has a five-dimensional model in the way to segment documents. We will discuss in depth these matters in Section 3.1.2.

Latter, to understand how we can use the Pentaformat model, we will introduce an example based on an article of the online edition of _The New York Times_ and we will analyze it through the five dimensions of the model. We will report this analysis in Section 3.1.3.

3.1.1 Five easy dimensions

To understand the Pentaformat model and consequently our language to segment XML documents (that we will introduce in Section 3.2), we must describe all its constituents. now we introduce one by one the five dimensions illustrated in Picture 8.

Picture 8 - The Pentaformat model

We call content all the non-structured information written by the author of the document. Thinking about a classic (X)HTML document referred to an article of a web newspaper, such as “The New York Times”, we identify as content:

  • the text of the article;
  • the close-up image;
  • the small and clickable pictures related to the article.

The other parts, such as the main menu or all the elements added automatically by scripts, are not considered as content. As we can see in Picture 9, we can associate to the content dimension the text of the article, the picture with a woman in foreground and a lot of children in background and the figure with a Baghdad map. These things are related to what the authors has written. All the other elements - such as menus (“Most popular” and the printing facilities menu), advertising images or videos (on top and right of Picture 9), the internal search box and so on - do not belong to the content dimension.

The logical organization of the whole information of a document is related to the structure. This dimension describes what kind of structure is used to contain a specific group of information, such as text, images, video, menu, etc. For example, in the first paragraph of the article in Picture 9 we can recognize two structures, as we can see in Code 2: a “p” element and an “a” element that represent a paragraph - the first paragraph - and a link respectively.

<p>
    BAGHDAD, Nov. 29 — As
    <a
        href="http://topics.nytimes.com/top/news/international/countriesandterritories/iraq/iraqi_refugees/index.html?inline=nyt-classifier"
        title="Recent and archival news about Iraqi refugees.">
            Iraqi refugees
    </a>
    begin to stream back to Baghdad, American military officials say the Iraqi government has yet to 
    develop a plan to absorb the influx and prevent it from setting off a new round of sectarian 
    violence.
</p>
Code 2 - Structures related to the first paragraph of Picture 9

Generally speaking, the presentation concerns how all the elements of a document look like. It is not finished because it exists more than one layer of presentation in a document and this is especially true if we refer to a digital document such as an (X)HTML document. The most obvious layer refers to the placement of the various (and structured) elements that compose the document. Another layer concerns the typographical and presentational layout - colors, background, fonts, etc. - of the document. A third layer is referred to all the elements that are not written by the author but are inserted into the document through some automatic processes, e.g. all the contextual information that we can see in any article of a web newspapers (the “Most popular” menu in Picture 9 is a good example of this kind of processes) or all the dynamic ads so often visible on websites (e.g. the ones using the Google Adsense platform).

All the dynamic elements of a digital document, such as ads, banners, logos and so on, are not correlated to the presentation only. In particular, all the elements that have some sort of “dynamism” or have any kind of interaction with the users can be described by the behaviour dimension. From this point of view, a link to another document belongs to this dimension as much as any script used to handle banners or all the AJAX technologies or the interaction with the visitors of the site.

The last dimension of the Pentaformat model is related to any information about the document itself or parts of it. These meta information, called metadata, enrich the document with assertions about the author, the creation date, the title, and so on. In that manner we allow these metadata to be used by machines, intelligent agents, indexing processes. All meta information, such as the Dublin Core metadata systems, represent a rib of the ambitious W3C project known as Semantic Web, as we can deduce from [BHL01]*.

In this section we have analyzed the five dimensions of the Pentaformat model understanding what they refer to. In order to answer the questions “Why do we use a five-dimensional model to segment a document?” and “Are a three-dimensional model not enough?” we have written a little explicative section (Section 3.1.2).

3.1.2 The need of a five-dimensions segmentation

In Section 3.1.1 we have explained what are the dimensions of the Pentaformat model and what they refer to. In this section we explain why Di Iorio have suggested a five-dimensional model [Dii07]*to treat the problem of document segmentation. After that we illustrate what kinds of benefits has the Pentaformat model.

As [GM02]* suggests, to analyze a document such as a poster or a book we can use a three layer model in which we can distinguish three fundamental constituents:

  • the content, i.e. all the information concerning the document itself, that answers to the question “what is it”;
  • the structure, i.e. what is the content location and in which structure it is contained, that answers the question “where is it”;
  • the presentation, i.e. how the structured content is shown, that answers to the question “what does it look like”.

Generally these three constituents are enough to segment a non-digital document. With the beginning of the Semantic Web (microformat, RDFa, RDF, OWL) and with the coming of the AJAX technologies (that have permitted the birth of Web 2.0), metadata and dynamic interaction (or dynamic behaviour) became fundamental keywords for the actual digital documents such as web pages. Then to segment in the best way any digital or non-digital document without any lost, [Dii07]* has suggested a five-dimensional model in order to allow:

  • the reuse of some parts of a document for a lot of purposes for different contexts;
  • composing different parts of different documents between them in order to create easily a new document based on multiple source;
  • the portability of a document in order to permit a platform independent visualization.

The major benefits in this five-dimensional approach to segment documents concern a sort of multiple but interconnected view of the same thing. In order to permit this view is necessary to connect the five dimensions of the model without any hierarchy: the users define the hierarchy structuring the document in the way they prefer. For this reason the Pentaformat model permits to see the same document with different point of views designed to work together. Combining these five different points of view we have a complete and sophisticated analysis for any document that allows an all-round reuse of parts of it in multiple contexts.

Understanding the benefits related to this five-dimentional model described in Section 3.1.1, we will propose in Section 3.1.3 an example of segmentation for an article of the online edition of _The New York Times._ in order to illustrate how we can use the Pentaformat to segment documents.

Picture 10 - Content identification for Picture 9

3.1.3 The Pentaformat segmentation for (X)HTML documents: an example

After the introduction of the Pentaformat model in Section 3.1.1, which we have used to develop our language to segment XML documents, and after the benefits explanation (Section 3.1.2) concerns the use of this model, in this section we suggest a simple and clear example based on an analysis of Picture 9. We consider all the dimensions one by one.

The identification of content for the article is quite simple because we base all our deductions on the question “What has the author written?”. From this construction we can identify the content how we see in Picture 10.

Moreover all the attributes “title” of all the elements “a” contained in the article body have been written by the author or by an automatic process. If we believe in the first hypothesis then the attribute “title” can be considered content, otherwise nothing. As we can see in Code 3 it is very difficult to understand who has written the text of the attribute (probably the author but it is not certain).

The structure identification is quite easy indeed because in (X)HTML code all the tags are related to a particular structure. For this reason the element “p” is a paragraph, the element “a” is a link, the element “div” is a section or a divider, and so on. There can be some cases in which the use of a particular tag is ambiguous, as we can see at Code 4. In this particular case the element “div” is not a section or a divider but it has a structure like a paragraph.

<div class="credit">
    Michael Kamber for The New York Times
</div>
Code 4 - An element “div” that can be considered a paragraph

As we have seen above, the presentation of a document is characterized by a multi-layer segmentation. In this case, the placement of all the elements and all the typographical entities are specified by some Cascading Style Sheets assertions ( [BCH07]*) using the tags “link” and “style”, as we can see at Code 5.

<link
    rel="stylesheet" 
    type="text/css" 
    href="http://graphics8.nytimes.com/css/common/global.css" />

<style type="text/css">
    @import url(http://graphics8.nytimes.com/css/common/screen/article.css);
</style>
Code 5 - Use of CSS in Picture 9

The other presentational layer - referred to any text, image, video, etc. that have not been written by the author - is identified in Picture 11.

Picture 11 - Some presentational entities for Picture 9

In the web page context, all the elements that allow any interaction with users or work dynamically with parts of the document belong to the behaviour dimension. As we can see in Picture 12, search engine boxes, links, videos, animations, all the items concerned to AJAX technologies, scripts are good examples of the behaviour dimension.

Picture 12 - Dynamic behaviour in Picture 9

The last but not least there is the metadata dimension. In a (X)HTML document there are many ways to define a metadata, from “meta” elements - as we can see in Code 6 - to Semantic Web technologies such as microformats, RDFa, RDF and OWL.

<meta  
    name="description" 
    content="U.S. military officials said the Iraqi government had yet to develop a plan 
             to absorb returning refugees and keep them from setting off a new round of violence.">
<meta  
    name="keywords" 
    content="Iraq,Immigration and Refugees,United States International Relations,
             Sunni Muslims,Shiite Muslims">
<meta  
    http-equiv="Content-Type" 
    content="text/html; 
    charset=iso-8859-1">
<meta  
    name="geo" 
    content="Iraq">
<meta  
    name="dat" 
    content="November 30, 2007">
<meta  
    name="tom" 
    content="News">
Code 6 - A part of “meta” elements in Picture 9

Not all the metadata are described by elements such as “meta”. there are also hidden metadata (we use “hidden” because they are not defined through a clear element) that refer to a specific element and not to the document itself only. Related to this context, all the “src” attributes are metadata of the related “img” element, all the “title” attributes are metadata of the related element and so on. In addition to this, some metadata can be hidden in any part of the document too. As we can see in Code 7, the text nodes of the elements “a” represent the article authors. By the way, they are metadata of the document but they are not indicated with a particular element: they are hidden in the text.

<div class="byline">
    By
    <a
        href="http://topics.nytimes.com/top/reference/timestopics/people/g/michael_r_gordon/index.html?inline=nyt-per"
        title="More Articles by Michael R. Gordon">
            MICHAEL R. GORDON
    </a>
    and
    <a
        href="http://topics.nytimes.com/top/reference/timestopics/people/f/stephen_farrell/index.html?inline=nyt-per"
        title="More Articles by Stephen Farrell">
            STEPHEN FARRELL
    </a>
</div>
Code 7 - Hidden metadata in the article body of Picture 9

In this section we have suggested an example about the Pentaformat model in order to understand how we can segment documents such as web pages. This example and the introduction of this model (Section 3.1.1 and Section 3.1.2) are necessary to describe our language, called Pentaformat Markup Language or PML, that allows to segment XML documents according to the Pentaformat model. As we have mentioned in the introduction of Chapter 3, this language is used in elISA 2.0 to segment XML documents. In Section 3.2 we will analyze the terminology and the syntax of PML and we will use a couple of examples in order to understand how to segment XML documents.

3.2 Pentaformat Markup Language

As we have seen in Section 3.1, the Pentaformat [Dii07]* is a good model to segment documents, such as web documents, identifying the content, the presentation and some other dimensions (structure, behavior and metadata). This segmentation can be useful in a context of extraction of data that we have introduced in Chapter 2.

In this section we focus on the Pentaformat Markup Language (PML), the language that we have developed to segment XML documents according to the Pentaformat model. Technically speaking, PML is a “parasite” language we use it to indicate whether some elements of an existing document, that is made with another XML language (called host), belong to one or more Pentaformat dimensions or not. We use this language to extend the old version on elISA [DVV04]* in order to segment document according to the Pentaformat model.

In the following sections we introduce the PML terminology and the XML syntax (Section 3.2.1) to show how we can use it through a couple of examples (Section 3.2.2).

3.2.1 Terminology and syntax

In this section we present the PML syntax that we use to perform document segmentations according to the Pentaformat model, introducing some terminologies referred to our language. The most important thing to understand is how we can make a pml declaration, i.e. a declaration in which we identify the Pentaformat dimension associated with some elements of a XML document. A pml declaration is formed by the following four elements:

  • a Pentaformat dimension that we consider;
  • name is a specific type related to the current dimension. The choosing values that we consider are specific for any dimension. All these values must be a string of alphabetical characters without numbers and spaces;
  • ref represents some items related to the current dimension;
  • content represents the declaration object.

A pml declaration can be written as we report in Code 8. In this definition “ref” and “content” are two Xpath 2.0 [BBC07a]* queries in which for the former query the context node is the document root and for the latter query the context node is the “ref” sequence.

<pml
    [dimension]
    [name]
    [ref]
    [content]
>
Code 8 - Pml declaration

Any pml declaration has a particular name depending by the “ref” and the “content”. If both values are referred to an one-item sequence we call the declaration atomic pml declaration, otherwise we call the declaration complex pml declaration. We illustrate this first simple difference using the example in Code 9.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Extract from Alice's Adventures in Wonderland</title>
    </head>
    <body>
        <h1>
            Extract from <em>Alice's Adventures in Wonderland</em> 
        </h1>
        <p>
            Alice was beginning to get very tired of sitting by her sister 
            on the bank, and of having nothing to do: once or twice she had 
            peeped into the book her sister was reading, but it had no 
            pictures or conversations in it, <q>and what is the use of a 
            book,</q> thought Alice, <q>without pictures or conversation?</q>
        </p>
        <p>
            So she was considering, in her own mind (as well as she could,
            for the hot day made her feel very sleepy and stupid), whether
            the pleasure of making a daisy-chain would be worth the trouble
            of getting up and picking the daisies, when suddenly a White 
            Rabbit with pink eyes ran close by her.
        </p>
    </body>
</html>
Code 9 - An extract from Alice's Adventures in Wonderland

We can see some examples about these pml declarations reported in Code 10.

<pml
    content
    Text
    //(p|h1|q)
    text()
>

<pml
    content
    Text
    //em
    text()
>
Code 10 - Two declaration for Code 9

The first declaration described in Code 10 is labelled complex because there are more than one item in the “ref” sequence and, sometimes, more than one item in the content sequence. On the other hand, the second declaration reported in Code 10 is labelled atomic because both the “ref” and the “content” sequences are formed by one item only.

A pml sequence is a sequence made by “([ref])/([content])” query that identifies which document items are objects of the pml set (it is the set of all declarations related to the document). Considering the example in Code 10, the pml sequence for the first pml declaration is (//(p|h1|q))/(text()). The sequence of all the pml sequences related to a specific dimension is called dimension sequence. An example of this kind of sequences, referred to the content dimension, is in Code 11

(
    ((//(p|h1|q))/(text())),
    ((//em)/(text()))
)
Code 11 - The content sequence for the pml declaration in Code 10.

In order to refer these pml declarations to an XML document, such as Code 9, we have made a small grammar that permits to insert a little group of elements and attributes into any XML document. All the information about the dimensions are introduced by a specific and qualified “dimensions” element. A general rule specifies that there must be at most one “dimensions” element in the document (where the position of it is irrelevant). This element contains all the pml declarations specified by five qualified elements - “content”, “structure”, “presentation”, “metadata” and “behaviour” - each of them has three qualified attributes: “name”, “ref” and “content”. A dimension element and its three attribute represent a pml declaration, as we can see in Code 12.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:pml="http://www.essepuntato.it/PML">
    <pml:dimensions>
        <pml:content
            pml:name="Text"
            pml:ref="//(p|h1|q)"
            pml:content="text()"
        />
        <pml:content
            pml:name="Text"
            pml:ref="//em"
            pml:content="text()"
        />
    </pml:dimensions>
    <head>
        <title>Extract from Alice's Adventures in Wonderland</title>
    </head>
    <body>
        <h1>
            Extract from <em>Alice's Adventures in Wonderland</em> 
        </h1>
        <p>
            Alice was beginning to get very tired of sitting by her sister 
            on the bank, and of having nothing to do: once or twice she had 
            peeped into the book her sister was reading, but it had no 
            pictures or conversations in it, <q>and what is the use of a 
            book,</q> thought Alice, <q>without pictures or conversation?</q>
        </p>
        <p>
            So she was considering, in her own mind (as well as she could,
            for the hot day made her feel very sleepy and stupid), whether
            the pleasure of making a daisy-chain would be worth the trouble
            of getting up and picking the daisies, when suddenly a White 
            Rabbit with pink eyes ran close by her.
        </p>
    </body>
</html>
Code 12 - An extract from Alice's Adventures in Wonderland with the pml declaration specified in Code 10

In conclusion we introduce the two following auxiliary items of the PML language:

  • a qualified attribute called “pid” (PML id) that can be used by any element of the original document;
  • a qualified element called “stone”, with the “pid” attribute optional, that can contain any type of node.

In Section 3.2.2 we will introduce some examples to understand how to segment an XML document using these elements.

3.2.2 Examples

In order to understand how we can use all the elements presented in Section 3.2.1, we illustrate two easy segmentation examples. We take into consideration the example in Code 13 that extends the example in Code 9 with a complete pml set.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:pml="http://www.essepuntato.it/PML">
    <pml:dimensions>
        <pml:content pml:name="Text" pml:ref="//body//element()" pml:content="text()" />
        
        <pml:structure pml:name="Root" pml:ref="/html" pml:content="." />
        <pml:structure pml:name="Head" pml:ref="//head" pml:content="." />
        <pml:structure pml:name="Title" pml:ref="//title" pml:content="." />
        <pml:structure pml:name="Body" pml:ref="//body" pml:content="." />
        <pml:structure pml:name="Paragraph" pml:ref="//p" pml:content="." />
        <pml:structure pml:name="Heading" pml:ref="//h1" pml:content="." />
        <pml:structure pml:name="Emphasis" pml:ref="//em" pml:content="." />
        <pml:structure pml:name="Citation" pml:ref="//q" pml:content="." />
        
        <pml:metadata 
            pml:name="Title" 
            pml:ref="/" 
            pml:content="//title/text()|concat(//h1/text()[1],//h1/em/text())" />
    </pml:dimensions>
    <head>
        <title>Extract from Alice's Adventures in Wonderland</title>
    </head>
    <body>
        <h1>
            Extract from <em>Alice's Adventures in Wonderland</em> 
        </h1>
        <p>
            Alice was beginning to get very tired of sitting by her sister 
            on the bank, and of having nothing to do: once or twice she had 
            peeped into the book her sister was reading, but it had no 
            pictures or conversations in it, <q>and what is the use of a 
            book,</q> thought Alice, <q>without pictures or conversation?</q>
        </p>
        <p>
            So she was considering, in her own mind (as well as she could,
            for the hot day made her feel very sleepy and stupid), whether
            the pleasure of making a daisy-chain would be worth the trouble
            of getting up and picking the daisies, when suddenly a White 
            Rabbit with pink eyes ran close by her.
        </p>
    </body>
</html>
Code 13 - An extract from Alice's Adventures in Wonderland with a complete pml set

As we can see in Code 13, the pml declaration referred to the metadata dimension has the value of “ref” attribute setted to “/”. As we know, this Xpath refers to the document root. In PML, all Xpath queries in “ref” that refers to the document root concern the document itself. Therefore this metadata declaration represents a metadata for the document: in this particular case it is the title of the document.

The segmentation of the next document, presented in Code 14, is more complex than the previous (Code 13).

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Dreaming Pentaformat - Home</title>
    </head>
    <body style="text-align:center;">
        <div class="header">
            <h1>
                <img src="pentagon.png" alt="A pentagon" title="Logo" />
                Home
            </h1>
            <p>
                All the descendant elements of the <q>div <i>class</i> header</q> element aren't content.
                They aren't written by S. but they are the result of an automatic process.
            </p>
        </div>
        <div class="content">
            <p>
                You are in the <a href="whatis.html" title="What is this?">Pentaformat
                Project <img src="small_pentagon.png" alt="A small pentagon" /></a> home page.
            </p>
        </div>
    </body>
</html>
Code 14 - An example written for the PML testing

The Code 14, written for the PML testing, can be segmented with all the five dimensions using the PML auxiliary items to perform a more accurate segmentation. We report an example of this improving segmentation in Code 15.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:pml="http://www.essepuntato.it/PML">
    <pml:dimensions>
        <pml:content 
            pml:name="Text" 
            pml:ref="//p[@pml:pid='p1']/descendant-or-self::element()" 
            pml:content="text()|@title|@alt" />
        <pml:content pml:name="Picture" pml:ref="//p[@pml:pid='p1']/img" pml:content="." />
        
        <pml:structure pml:name="Divider" pml:ref="//div" pml:content="." />
        <pml:structure pml:name="Paragraph" pml:ref="//p" pml:content="." />
        <pml:structure pml:name="Root" pml:ref="/html" pml:content="." />
        <pml:structure pml:name="Head" pml:ref="//head" pml:content="." />
        <pml:structure pml:name="Title" pml:ref="//title" pml:content="." />
        <pml:structure pml:name="Body" pml:ref="//body" pml:content="." />
        <pml:structure pml:name="Heading" pml:ref="//h1" pml:content="." />
        <pml:structure pml:name="Image" pml:ref="//img" pml:content="." />
        <pml:structure pml:name="Link" pml:ref="//a" pml:content="." />
        <pml:structure pml:name="Emphasis" pml:ref="//i" pml:content="." />
        <pml:structure pml:name="Citation" pml:ref="//q" pml:content="." />
        
        <pml:presentation pml:name="InnerCSS" pml:ref="//body" pml:content="@style" />
        <pml:presentation 
            pml:name="Header" 
            pml:ref="//div[@class = 'header']/(.|.//element())" 
            pml:content="text()|.|@title|@alt"/>
        <pml:presentation pml:name="Logo" pml:ref="//img[@title = 'logo']" pml:content="."/>
        
        <pml:metadata pml:name="Title" pml:ref="/" pml:content="//head/title/text()|//h1/text()" />
        <pml:metadata pml:name="Title" pml:ref="//img|//a" pml:content="@title" />
        <pml:metadata pml:name="Source" pml:ref="//img" pml:content="@src" />
        <pml:metadata pml:name="Description" pml:ref="//img" pml:content="@alt" />
        <pml:metadata pml:name="Author" pml:ref="/" pml:content="//pml:stone[@pml:pid='s1']/text()" />
        
        <pml:behaviour pml:name="OpenLinkedDocument" pml:ref="//a" pml:content="@href" />
    </pml:dimensions>
    <head>
        <title>Dreaming Pentaformat - Home</title>
    </head>
    <body style="text-align:center;">
        <div class="header">
            <h1>
                <img src="pentagon.png" alt="A pentagon" title="Logo" /> 
                Home
            </h1>
            <p>
                All the descendant elements of the <q>div <i>class</i> header</q> element aren't content.
                They aren't written by <pml:stone pml:pid="s1">S.</pml:stone> but they are the result of 
                an automatic process.
            </p>
        </div>
        <div class="content">
            <p pml:pid="p1">
                You are in the <a href="whatis.html" title="What is this?">Pentaformat
                Project <img src="small_pentagon.png" alt="A small pentagon" /></a> home page.
            </p>
        </div>
    </body>
</html>
Code 15 - A segmentation for Code 14

In this example we have used both the “pid” attribute and the “stone” element: the former is used to refers to a specific element “p” in the first content declaration, in order to exclude the child element “p” of the element “div class header”; the latter is used to identify the document author (as we can see in the last metadata declaration).

We think that these two examples clarify the use of PML to segment an XML document. The output of the new version of elISA [DVV04]* is a PML document obtained through the application of a meta-XSLT [Kay07]* with an XML document. This latter document specifies some rules to identify the roles of the input document elements. In Section 3.3 we will discuss this new engine introducing its new features.

3.3 The new Extraction of Layout Information via Structural Analysis

After the introduction of the Pentaformat model [Dii07]* in Section 3.1, that allow to segment any document using five different but connected dimensions (content, structure, presentation, behaviour, metadata), and after the explanation of our language - the Pentaformat markup language or PML - that we use to segment XML documents, in this section we introduce elISA (Extraction of Layout Information via Structural Analysis) [DVV04]*, a rule-based engine to segment XML document in two main dimensions: content and presentation. Our goal was to re-write the engine in order to segment XML documents according to the Pentaformat model producing as output a PML document.

First of all we want to introduce the main context, concerning the extraction of data (see Chapter 2), in which elISA works. Referring to the recognition of structured content, some studies ( [Ven03]*, [Bag04]* and [DVV04]*) use elISA to extract all the content of a web page in order to realize a process of global editability for an ambitious project of the University of Bologna called ISAWiki [DV04]*, client/server platform used to create, modify or reuse any web page. We will explain this framework clarifying the role of elISA in Section 3.3.1.

After this brief overview, we will introduce all the new features of the new version (2.0) of elISA in order to clarify what kind of elaborations can be performed on XML documents. The details of these features will be explained in Section 3.3.2.

3.3.1 elISA: a rib of ISAWiki

To understand the context in which elISA [DVV04]* - the rule-based engine to segment XML document according to content and presentation, presented in Section 2.4 - works, we must introduce the framework that uses it: ISAWiki [DV04]*. It is a client/server platform, inspired by Ted Nelson's Xanadu project [Nel80]*, in which every signed user can create, modify or reuse any web page through a client side editor, the ISAWiki editor. All the pages created or modified through this editor are saved on a ISAWiki server in an intermediate language, called ISAWiki Markup Language or IML [San06]*. This language is used to store the structured content of the document, leaving out the presentation. This process handles a version control and a local storing of all the new or modified documents whereas all the original documents remain on their respectively server. In addition to this, when we have modified a web page we can ask ISAWiki to transform this particular document in another one according to one of following seven formats: HTML, XML, PDF, DOC, ODF, Wiki and Latex.

What a user can modify in a web page is the main thing which the ISAWiki developers discussed. They have chosen to allow the editing of the content related parts - such as the text of an article, image concerning an article, and so on - denying any change about the presentation or the dynamic behaviour of a web page. They have need to identify what in a web page is content or not. For this reason they have developed an engine to perform this task: Extraction of Layout Information via Structural Analysis or elISA. This engine is able to identify the content of a web page and some typical layout elements, such as logo, layout table, advertising banner and so on. The elISA processing is based on a set of rules (specified by an XML document) that allows the roles of the XML document elements throw a structural analysis.

The main goal of this version of elISA is to identify the content of a web page. The engine includes three main components:

  • a set of rules (written complying to a specified DTD grammar) to identify the role of the document elements on the basis of their structure;
  • a meta-XSLT [Cla99]* that gets in input the rule-set document and returns a new XSLT to transform the original document into another one;
  • a client interface, written in Javascript, that allows the users to use the engine and to see the result that it produces.

The engine processing result can be two different things: the former is a new document where every specific part is colored according to its role; the latter is a new IML document, i.e. a document in which we consider the structured content leaving out all the other dimensions. We can see an example of this ISAWiki processing in Picture 14.

Picture 13 - An elISA analysis

In this picture you can see the a web page from the CNN web site and the same page scanned by elISA. As you can see, the cyan zones are text areas, while the orange areas are layout cells and the pink zones are navigation zones.

This version of elISA get a well-formed XML document to work. Today, as we know, the most part of web pages are not XML. In fact a lot of articles of online newspapers use a mixed language between HTML (that is not XML but SGML) and XHTML obtaining ugly markup results, as we can see in Code 16.

Picture 14 - ISAWiki process with the old version of elISA

<a
    href="javascript:pop_me_up2('http://www.nytimes.com/imagepages/2007/11/29/world/20071130_REFUGEES_GRAPHIC.html','439_983','width=439,height=983,location=no,scrollbars=yes,tool bars=no,resizable=yes')">
    <IMG 
        src="http://graphics8.nytimes.com/images/2007/11/29/world/20071130_REFUGEES_GRAPHIC19.jpg" 
        height="126" 
        width="190" 
        alt="A New, Sectarian Map" border="0">
    <span class="mediaType graphic">
        Graphic
    </span>
</a>
Code 16 - A not well formed markup in the article “Iraq Lacks Plan on the Return of Refugees, Military Says” (The New York Times)

The well-forming process is realized by an external tool in order to prepare a correct input for elISA. In addition to this, the use of the Pentaformat model in ISAWiki for the document analysis is not achievable with this specific version of elISA because it handles content and (a part of) presentation dimensions only. For this reason we have realized a new elISA engine (version 2.0), written in Java, in order to use all the capabilities of the Pentaformat model, extending the old version with a couple of new features. We will discuss them in Section 3.3.2.

3.3.2 New features

In order to allow the use of the Pentaformat model in the ISAWiki context [DV04]*, we have developed a new version (2.0) of elISA [DVV04]* to segment XML documents using the language presented in Section 3.2, PML. In this section we analyze the new features that characterize this new engine.

As its ancestor, elISA 2.0 allows to specify a rule-set through an XML document that complies to a specific grammar written in RelaxNG? [Oas01]*. This grammar is quite similar to the old DTD grammar except that we have added some new elements in order to help users in the rules writing. We can see an example of a rule-set document in Code 17.

<?xml version="1.0" encoding="UTF-8"?>
<rules xmlns="http://www.essepuntato.it/Rules">
    <rule context="div">
        <call name="ancestor.class" select="ancestor-or-self::div[exists(@class)]" />
        <check>
            <whenever test="empty(text()[normalize-space() != 0]) or exists(.//div)">
                <setStructure name="Divider" ref="." content="." weight="1.0"/>
            </whenever>
            <otherwise>
                <setStructure name="Paragraph" ref="." content="." weight="1.0"/>
                <setContent name="Text" ref="." content="text()[normalize-space() != '']" weight="0.3"/>
            </otherwise>
        </check>
        
        <check>
            <whenever>
                <test>
                    <containsOnly contextString="for $el in $ancestor.class return $el/@class">
                        <value>post</value>
                        <value>body</value>
                        <value>content</value>
                    </containsOnly>
                </test>
                <setContent name="Text" ref="." content="text()[normalize-space() != '']" weight="0.5"/>
            </whenever>
        </check>
        
        <check>
            <whenever test="exists(@style)">
                <setPresentation name="InnerCSS" ref="." content="@style" weight="1.0" />
            </whenever>
            <whenever test="exists(@onload)">
                <setBehaviour name="OnLoad" ref="." content="@onload" weight="1.0" />
            </whenever>
            <whenever test="exists(@title)">
                <setMetadata name="Description" ref="." content="@title" weight="1.0" />
            </whenever>
        </check>
    </rule>
</rules>
Code 17 - A (very little) rule-set document for the “div” element only

In this little document we have defined an easy (and a bit naive) rule to handle all the “div” elements for a (X)HTML document using the element “rule”, specifying as “context” an Xpath 2.0 query that refers to all these elements. Inside this “rule” element, we have made a variable declaration - the element “call” named “ancestor.class”. It collects, through a Xpath 2.0, all the current and all the ancestor elements “div” with the “class” attribute specified. Even more we have specified some conditional checkpoints. Every checkpoint has one or more if statements called “whenever”, that allow to specify (through “setContent”, “setStructure” and so on) one or more pml declarations for all the elements referred in “ref” and “content” (where the “ref” sequence represents the current context for the “content” sequence, as usual). On the contrary, these are not classic pml declarations, but they are probabilistic pml declarations. In fact, through the attribute “weight”, we associate a value, from 0 to 1, to a pml declaration. Two probabilistic pml declarations are similar if they have the same dimension and “name” and they refer to the same “ref” and “content” elements too. Two similar probabilistic pml declaration can be re-written in a probabilistic pml declaration where the “weight” attribute is equal to the sum of the “weight” attributes of the two previous declarations. To understand this point, we image to have three different “div” in a (X)HTML document:

  • one has some text inside;
  • one does not contain text but it has a “class” attribute with the value “post”;
  • one has both the previous features.

If we take into consideration the rule specified in Code 17, the weight values of the probabilistic pml declaration (dimension = content, name = “Text”) referred to these three “div” are:

  • 0.3 for the “div” with text;
  • 0.5 for the “div” without text and with the “class” specified;
  • 0.8 for the “div” with text and with the “class” specified.

After we have specified all the probabilistic pml declaration for all the elements of a document, we can make a choice among these declarations evaluating their weight value. In order to make these choices we have made a new simple language, written in RelaxNG?, to define thresholds. As we can see in Code 18, we have defined a simple threshold for the same elements “div” used in Code 17. This simple threshold, referred to the content dimension, takes into consideration only the pml declarations referred to all the elements “div” with weight greater or equal to 0.75. We can find a clear explanation of the syntax and the semantics of this two languages in Section 4.1.

<?xml version="1.0" encoding="UTF-8"?>
<thresholds xmlns="http://www.essepuntato.it/Thresholds">
    <threshold context="div">
        <content>
            <select>
                <weight ge="0.75"/>
            </select>
        </content>
        <structure>
            <bestWeight priority="Divider Paragraph"/>
        </structure>
        <presentation>
            <select>
                <weight gt="0.8"/>
            </select>
        </presentation>
        <metadata>
            <select>
                <weight gt="0.6"/>
                <name value="Description"/>
            </select>
        </metadata>
        <behaviour>
            <select>
                <weight ge="0.7"/>
            </select>
        </behaviour>
    </threshold>
</thresholds>
Code 18 - A possible thresholds document in according to Code 17

Then we have added in elISA 2.0 some new features - such as pml declaration handling and the thresholds - in comparison to the old version. Moreover we can control two other aspects of whole process that are handled externally in the previous version of the engine. In this case we refer to a well-former and a plugin loader. The former well-forms a not well-formed document in order to use it into the elISA processing. The latter is able to load all the plugins, written according to a specific Java interface, in order to add information (new elements, new attributes and so on) to the input document. For example, if we want to add more informations about dimensions of all the pictures of a document we can write a plugin that:

  • gets all source file of any “img” element;
  • works out the width and the height of any picture;
  • inserts these information as a qualified attribute of the respective element “img”.

We will deeply discuss how to make a plugin for elISA 2.0 in Section 4.1.

As we can see in Picture 15, the result of this three steps process is a PML document containing all the pml declarations selected in the thresholds step. This is the most important difference between the old version of elISA and elISA 2.0: while the former returns an IML document, the new version returns a document with more information than an IML document. The question is that PML is not supported by the current version of ISAWiki. In order to use elISA 2.0 in the ISAWiki platform we must convert PML documents into IML documents.

In this section we have discussed all the new elISA 2.0 features and we have introduced the complication to use this new engine in ISAWiki (the platform that we have presented in Section 3.3.1). The conversion from PML into IML seems easy but it hides a trap. In fact any IML document has another fundamental feature: it complies to seven structural patterns to organize the content. Obviously PML does not comply to this structural model because it does not force a hierarchical content structuralization. For this reason we need a patterning engine that converts a PML document into a new patterns-compliant document with the same pml declarations of the original document. We will explain in depth all these matters in Section 3.4.

Picture 15 - The three steps of elISA analysis

3.4 From PML to IML

As we have seen in Section 3.3, elISA 2.0 generates a PML document as result of its analysis. Unfortunately, for a complete integration in the ISAWiki framework [DV04]* we need to convert this kind of outputs in an IML document [San06]* because the latter is the language in which ISAWiki stores its documents. This conversion is not easy because IML - but not PML - has another important feature: it structures content complying to seven structural patterns [Dii07]*. In reality there is another solution to make possible the use of PML in ISAWiki: to change the whole structure of ISAWiki in order to handle PML as the intermediary language in which it saves all documents. We do not take into consideration this latter proposal for a main reason: to change ISAWiki in order to include PML is too complex because of the big dimension of the platform. Then we have made the choice to develop an automatic mechanism to convert a PML document into an IML document.

In this section we will introduce the issue related to the structured content of XML documents according to some structural patterns and we will explain the benefits of this approach (Section 3.4.1). Then we will introduce the seven patterns that we can be used to structure the content of a XML document (Section 3.4.2) through some examples. After we have understood what content model characterizes these patterns and after we have remembered (Section 3.4.3) that the patterned structure is the main difference between IML and PML, we will propose four operations to pattern XML elements (Section 3.4.4). We will introduce in Section 3.4.5 a new language and the related engine to pattern an XML document in order to preserve the old unpatterned structure. We have called this language PML patterns.

3.4.1 The issue of structured content

The “PML to IML” conversion seems easy at the surface: we must transform a document with five dimensions into a document with two dimensions only. It is true, but, moreover the two dimensions, IML has a further feature: all its structures comply to a specific structural pattern.

As we know the Pentaformat model (and PML too) does not force a hierarchy among the dimensions or among elements belonged to a particular dimension. For example a particular structure for the content is specified by the document authors. [Dii07]* suggests a solution to avoid problems in the content structuring of the XML documents through a model used to express and normalize structured content of any document. Moreover this model, based on seven structural patterns, is useful to capture two of the five Pentaformat dimensions: the content and the structure. The reason to use a pattern approach during the content structuring is justified by well-known scientific literature: when we find a group of similar problems is useful to find a common approach to solve them. This common solution is called pattern. This approach have been suggested by the architect Christopher Alexander in [Ale79]* and it have been reused in the Computer Science field by Erich Gamma et al [GHJ95]*.

There are some positive aspects in pattern use, among which:

  • the possibility to reuse a particular solution in different contexts or projects;
  • to handle easily the structure of a document, providing a clear organization of all its elements;
  • to make easy and understandable complex structures composing different simple patterns.

On this basis, in Section 3.4.2 we will introduce seven structural patterns used in IML to structure the content. To Understand how IML uses these patterns is fundamental to develop an engine that converts a PML document into a new patterned document with the same pml declarations. To transform this new patterned document into an IML document is really easy and it allows to use a PML document into ISAWiki.

3.4.2 Seven patterns

As we have introduced in Section 3.4.1, IML uses some structural patterns to structure content. This is the most important difference between IML and PML. Our goal is to convert PML documents into IML documents in order to use elISA 2.0 (Section 3.3) in ISAWiki [DV04]*. Before to realize this conversion we must understand what are these patterns and what kind of content models they have. We try to clarify these issues in this section.

In [Dii07]*, [DDD07]* and [Gub04]* the authors suggest some patterns to structure any XML document through the definitions of the respective content models. In this section we introduce a clear distinction between these seven patterns. All the examples in the following descriptions are referred to (X)HTML grammar and we consider any element is associated to one pattern only, as usual.

The first pattern which we introduce is marker, i.e. an empty element that can have zero or more attributes. This pattern is splitted in other two subpatterns according to the context:

  • we call milestone all the markers in which the position that they assume in the document represents the most relevant feature. All the attributes correlated to this kind of elements are metadata of the element itself. E.g., the element “img” (with its attribute “src”) is a perfect example of this subpattern;
  • we call meta all the markers in which the existence is important and the position is not relevant. All these elements assume the same value independent of the position that they have in the document. A good example of an element that complies to this kind of patterns is “meta”.

We can see an use of these two subpatterns in Code 19.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta name="author" content="Silvio Peroni"/>
        <meta name="description" content="An example in which we introduce the use of markers"/>
    </head>
    <body>
        <p>
            In this paragraph we insert a picture like
            <img src="http://www.essepuntato.it/point.png" alt="A picture"/>
            to exemplify the use of <em>milestone</em> markers.
        </p>
    </body>
</html>
Code 19 - Using markers in a (X)HTML document

An atom is the pattern for all the elements that can contain text only. Two elements that use this pattern are “title” and “script”, as we can see in Code 20.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>What is an atom?</title>
    </head>
    <body>
        <p>
            This document introduce two example of <em>atoms</em>. One is the element <q>title</q>, other 
            one is the element <q><script type="text/javascript">document.write("script")</script></q>.
        </p>
    </body>
</html>
Code 20 - Using atoms in a (X)HTML document

The next two patterns, inline and block, have the same content model but they differ for one aspect: the former can contains itself and this is not true for the latter. Generally they contain elements that comply to the patterns milestone marker, atom or inline (all repeatable) and can also contain text. As we can see in Code 21, there are many elements that comply to these two patterns, for example “p”, “h1” for block and “em”, “i”, “b” for inline.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Inlines and blocks: an old story</title>
    </head>
    <body>
        <h1>An <em>old</em> story</h1>
        <p>
            In this example we introduce two different <i>pattern</i>, <b>inline</b> and <b>block</b>,
            and some respective elements.
        </p>
    </body>
</html>

Code 21 - Using inlines and blocks in a (X)HTML document

The last three patterns - container, table, record - concern the organization of the content only. Actually they do not contain text but only all the elements that comply to the following patterns: meta marker, atom, block, container, table and record. The difference among these three patterns is how they handle the element repeatability:

  • the pattern container contains all optional or repeatable elements. For example, a “div” (operated as a tag without text) can contains zero or more paragraphs, zero or more lists and so on;
  • the pattern table contains homogeneous and repeatable elements. A good example of an element that complies to this pattern is “ul” or “ol”;
  • the pattern record contains all optional and not repeatable elements.

A full example that introduces all these patterns is in Code 22.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>About the structure of a document</title>
    </head>
    <body>
        <p>
            In this example we introduce the three patterns used to structure a document:
        </p>
        <ul>
            <li><p>container;</p></li>
            <li><p>table;</p></li>
            <li><p>record.</p></li>
        </ul>
        <p>
            Are they enough?
        </p>
    </body>
</html>

Code 22 - Using containers, tables and records in a (X)HTML document

Other than the content models, that we can see summarized in Table 1, any of these seven patterns has associated a particular characterization called behaviour. It modifies, where it is possible, the content model of an element. The possible behaviours are three:

  • the standard behaviour, that does not modify the original content model;
  • the additive context behaviour, that allow to add one or more elements to the element that complies to this behaviour and to its descendants;
  • the subtractive context behaviour, that allow to remove one or some elements from the element that complies to this behaviour and from its descendants.

Mainly there are two benefits using these seven patterns to structure a document:

  • with these seven patterns we can make a document with a clear structure, knowing every time we want the real role of any element;
  • if we know that a document is made through these patterns then we can deduce, with the algorithm described in [DDD07]*, what is the pattern associated to any element.

EMPTY text milestones meta atom inline block container table record
milestones X                  
meta X                  
atom   X                
inline   X X   X X        
block   X X   X X        
container       X X   X X X X
table       X X   X X X X
record       X X   X X X X

Table 1 - Summarizing table for the content models of all patterns

In this section we have introduced the seven patterns that IML uses to structure the content. In Section 3.4.3 we will explain how this feature is the most relevant difference between IML and PML.

3.4.3 PML and IML: what is the difference?

Besides the issue about dimensions, the main difference between IML and PML is that the former is strongly patterned while in the latter a patterned structure is completely optional. Moreover a lot of web documents, such as well-formed (X)HTML documents, are not patterned. We can take into consideration a common container element, such as a table “td”. As we know, a container does not contain text. For this reason all the “td” elements in Code 23 are not patterned. Generally speaking, in any (X)HTML document neither tables are patterned.

<p>
    The following illustrates a simple table with three columns and six rows. 
    The first row is not counted, because it is only used to display the column 
    names. This is traditionally called a "header row".
</p>
<p>
    <b>Age table:</b>
</p>
<table class="wikitable" border="5">
    <tbody>
        <tr>
            <th>first</th>
            <th>last</th>
            <th>age</th>
        </tr>
        <tr>
            <td>Nancy</td>
            <td>Davolio</td>
            <td>33</td>
        </tr>
        <tr>
            <td>Nancy</td>
            <td>Klondike</td>
            <td>43</td>
        </tr>
        <tr>
            <td>Nancy</td>
            <td>Obesanjo</td>
            <td>23</td>
        </tr>
        <tr>
            <td>Justin</td>
            <td>Saunders</td>
            <td>37</td>
        </tr>
        <tr>
            <td>Justin</td>
            <td>Timberlake</td>
            <td>26</td>
        </tr>
        <tr>
            <td>Amy</td>
            <td>Mes</td>
            <td>11</td>
        </tr>
    </tbody>
</table>
Code 23 - An extract from the article “Table (information)” of Wikipedia

The goal is to convert a PML document into an IML document in order to use elISA 2.0 into ISAWiki. To allow this conversion we have developed an engine that can pattern any XML document, on the basis of some rules, preserving all the pml declarations. Before the introduction of this new engine, we need to illustrate what kind of operations we can use to pattern XML documents. We will discuss this issue in Section 3.4.4.

3.4.4 Patterning process

We have explained that to use elISA 2.0 into ISAWiki we need to convert PML documents into IML documents. This operation is not easy because IML - but not PML - structures content according to seven structural patterns, as we have introduced in Section 3.4.2. The point is to pattern PML documents through a specific engine which we have developed. To understand how we can specify some patterning rules to the engine we need to explain what kind of operations we can use to pattern the elements of any XML documents. In this section we answer to this question.

After an analysis based on some simple and complicate examples, we have found out these two main operation:

  • the wrap operation on an element allows to insert as its children one or more elements in order to pattern its structure;
  • the unwrap operation on an element allows to remove it from the document.

These two operations represent the minimum set of operations to pattern any XML document. Now, to understand what these operation can do, we make some examples. As we can see in Code 24 (a little example of a non-patterned document in which “div” is a container and “i” is an inline), the “div” element does not comply to its pattern because it contains text and an inline element too.

<?xml version="1.0" encoding="UTF-8"?>
<div>
    This is a little example to understand the <i>wrap</i> operation.
</div>
Code 24 - A little example of a not patterned document

To pattern the document we apply a wrap on the “div” with a block element like “p” in order to enclose all its children. Through this simple operation we transform the original document in Code 24 in a patterned document, as we can see in Code 25.

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <p>
        This is a little example to understand the <i>wrap</i> operation.
    </p>
</div>
Code 25 - A patterned version of the document in Code 24

We can also apply specific and multiple wrap operations. In the document presented in Code 26 there is a fake unordered list because there is text in a not correct position. We want to pattern this text element in order to make a correct list and we must apply two different patterns to obtain that.

<?xml version="1.0" encoding="UTF-8"?>
<ul>
    <li><p>Correct nesting;</p></li>
    Text without a correct nesting;
    <li><p>Another correct nesting.</p></li>
</ul>
Code 26 - Another not patterned document

Even in this case the solution is simple. We apply a wrap on the “ul” text children with a container element “li”. Now, on this result, we apply another wrap on all the children of the new “li” (in this case they are text nodes only) with a block pattern such as “p”, obtaining a perfect patterned document, as we can see at Code 27.

<?xml version="1.0" encoding="UTF-8"?>
<ul>
    <li><p>Correct nesting;</p></li>
    <li><p>Text without a correct nesting;</p></li>
    <li><p>Another correct nesting.</p></li>
</ul>
Code 27 - A patterned version of the document in Code 26

To illustrate how the unwrap operation work we can take into consideration the document in Code 28. In this example we can see an incorrect nesting of two paragraphs. This situations is not possible in a patterned document: a block cannot contain any block.

<?xml version="1.0" encoding="UTF-8"?>
<p>
    <p>
        Too many paragraphs...
    </p>
</p>
Code 28 - A not patterned document with too many paragraphs

The solution for this example is easy to find out: we apply an unwrap operation on the first paragraph, i.e. the document element, obtaining a perfect patterned document as we can see in Code 29.

<?xml version="1.0" encoding="UTF-8"?>
<p>
    Too many paragraphs...
</p>
Code 29 - A patterned version of the document in Code 28

How can we solve a situation such as Code 30? In this case we must remove the first “p” adding a list element through a multiple operation on the same element, in order to pattern the document.

<?xml version="1.0" encoding="UTF-8"?>
<p>
    <li><p>A list item;</p></li>
    <li><p>Another list item;</p></li>
    <li><p>But where is the list?</p></li>
</p>
Code 30 - Where is the list?

So we can an unwrap operation on the first “p” (the document element) and a wrap operation on all its children with a list element (for example “ul”). As we can see in Code 31, through this multiple operation we obtain a perfect patterned document.

<?xml version="1.0" encoding="UTF-8"?>
<ul>
    <li><p>A list item;</p></li>
    <li><p>Another list item;</p></li>
    <li><p>But where is the list?</p></li>
</ul>
Code 31 - A patterned version of the document in Code 30

With this two operations (wrap and unwrap) we can pattern any XML document. There can be some contexts in which it is not convenient combining these two different operations. As we can see in Code 30, instead of using the multiple operation, we can use rename operation to change immediately the element “p” without any direct wrap or unwrap operations. Another example of this kind of troubles is introduced in Code 32. In this case we want to pattern the document changing the element “b” position in order to place it in a correct location.

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <b>
        <p>
            Text with a <i>bold</i> style.
        </p>
        <div>
            <p>
                Another text with a <i>bold</i> style.
            </p>
        </div>
    </b>
</div>
Code 32 - A not patterned document with some text in bold

A possible solution is to apply an unwrap operation on “b” and a wrap operation on any “p” with a new element “b”. Maybe it is too complicate. Should be easier use a specific operation that is applied only on the element “b”, obtaining the same result illustrated in Code 33.

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <p>
        <b>Text with a <i>bold</i> style.</b>
    </p>
    <div>
        <p>
            <b>Another text with a <i>bold</i> style.</b>
        </p>
    </div>
</div>
Code 33 - A patterned version of the document in Code 32

To solve these situations we introduce two new operations obtained composed by the wrap and the unwrap:

  • rename an element in order to change its pattern or its name. This operation is obtained applying an unwrap and a wrap on the element that we want to rename;
  • inject an element in order to remove the element and re-inject it in all descendant elements that can insert it as a child according to their content model.

With only the inject we can apply multiple wrapping operations as we have already seen for wrap and unwrap.

An engine based on these four operation must have two fundamental features: it must preserve the old unpatterned structure of the document; it must preserve all the pml declarations even though the document structure changes for some patterning operation. To allow the former feature we have developed another language, called PML patterns (or PMLp): we will illustrate it in Section 3.4.5, introducing our engine to pattern XML documents. Through this engine we can transform a PML document into a patterned PML+PMLp document and then we can convert the latter document into an IML document, in order to use elISA 2.0 into ISAWiki.

3.4.5 PML patterns (PMLp)

In this section we introduce the language that we use to pattern XML documents and the patterning engine that we have developed. These two things are used to convert a PML document into an IML document. The chain application of elISA 2.0, of the patterning engine and of a simple meta-XSLT allows the use of elISA 2.0 into the current version of the ISAWiki platform. Now we see how the patterning engine returns a perfect patterned XML document with all the specified pml declarations of the old document.

According to the Section 3.4.4, another problem for the patterning operation is how we can associate a pattern to any element of a XML document in order to pattern it. The answer in this case is PML. We can use this language to associate to any element of a document one of the seven patterns that we have seen in Section 3.4.2. For this reason we extend the current PML language with a new assumption: the value of “name” of any pml declaration concerning the structure dimension has seven basic values - “Pmarker”, “Patom”, “Pinline”, “Pblock”, “Pcontainer”, “Ptable”, “Precord”. Any other possible values (such as “Paragraph”, “Divider” and so on) has a subclass relation with only one of the seven basic values.

On the basis of new PML version, we have developed a Java rule-based patterning engine that solves any pattering issue. The document patterning realized by this engine produces a new document based on a new language called Pentaformat Markup Language and patterns or simply PMLp. Evidently the patterning process completed by the engine preserves correctly all the pml declarations of the original non-patterned document even though some new elements have been added or removed during the patterning process. Now, in order to see how we can define the patterning rules and how the engine works, we see the examples in Section 3.4.4 again.

To obtain a patterned document from the example in Code 24, we define a wrap rule for all the elements “div”, as we can see in Code 34. The patterning-rule document is an XML document in which we can define local or global variables with the element “variable”. Other than we can define rules with the “pattern” element. This rule matches with the Xpath 2.0 value contained in the attribute “match”. The “pattern” element is composed by some if/else if statements defined by a succession of “when” elements. We can specify as “when” child only one of the four operations introduced in Section 3.4.4.

<?xml version="1.0" encoding="UTF-8"?>
<patterns xmlns="http://www.essepuntato.it/Patterns">
    <variable name="text.inline" select="text()[normalize-space() != '']|element()[f:isInline(.)]"/>
    
    <pattern match="div">
        <choose>
            <!-- Rule 1 -->
            <when test="exists(element()[f:isInline(.)])">
                <wraps>
                    <wrap pattern="Pinline" select="$text.inline" />
                </wraps>
            </when>
        </choose>
    </pattern>
    
    <pattern match="p">
        <choose>
            <!-- Rule 2 -->
            <when test="count(element()[f:isBlock(.)]) = 1 and empty(text()[normalize-space() != ''])">
                <unwrap />
            </when>
            <!-- Rule 3 -->
            <when test="count(li) = count(element())">
                <rename pattern="Ptable"/>
            </when>
        </choose>
    </pattern>
    
    <pattern match="ul">
        <choose>
            <!-- Rule 4 -->
            <when test="exists(text()[normalize-space() != '']|element()[f:isInline(.)])">
                <wraps>
                    <wrap pattern="Pcontainer" select="$text.inline">
                        <wrap pattern="Pblock" />
                    </wrap>
                </wraps>
            </when>
        </choose>
    </pattern>
    
    <pattern match="b">
        <choose>
            <!-- Rule 5 -->
            <when test="exists(element()[f:isBlock(.) or f:isContainer(.)])">
                <inject />
            </when>
        </choose>
    </pattern>
</patterns>
Code 34 - A patterning rule-set to solve all the examples in Section 3.4.4

Through these few rules we pattern all the examples showed in Section 3.4.4. One of the engine goals is to preserve, somehow, the original structure of the document. To complete this goal, the engine uses a qualified elements and attributes to remove or describe elements. New we consider the examples in Code 24 and Code 26, where we use a single and some multiple wrap operations. In this cases, as we can see in Code 35 and in Code 36, we identify the element that applies the wrap operation with an identifier specified by the qualified “pmlp:wrap” attribute. The “pmlp:wrapped” attribute refers to the element that has applied the wrap operation.

<?xml version="1.0" encoding="UTF-8"?>
<div pmlp:wrap="div1" xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
    <pml:dimensions>
        <pml:structure pml:name="Pcontainer" pml:ref="//div" pml:content="." />
        <pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
    </pml:dimensions>
    <p pmlp:wrapped="div1">
        This is a little example to understand the <i>wrap</i> operation.
    </p>
</div>
Code 35 - How the engine pattern the document in Code 24

<?xml version="1.0" encoding="UTF-8"?>
<ul pmlp:wrap="ul1" xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
    <pml:dimensions>
        <pml:structure pml:name="Ptable" pml:ref="//ul" pml:content="." />
        <pml:structure pml:name="Pcontainer" pml:ref="//li" pml:content="." />
        <pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
    </pml:dimensions>
    <li><p>Correct nesting;</p></li>
    <li pml:wrapped="ul1"><p pml:wrapped="ul1">Text without a correct nesting;</p></li>
    <li><p>Another correct nesting.</p></li>
</ul>
Code 36 - How the engine pattern the document in Code 26

An unwrapped element is removed to be replaced by a qualified pmlp element called “old” with two obligatory and qualified attributes: “pmlp:name”, that contains the prefixed name of the old element, and “pmlp:unwrap”, that represents the identifier for the operation. We can see an example of this operation in Code 37.

<?xml version="1.0" encoding="UTF-8"?>
<pmlp:old pmlp:name="p" pmlp:unwrap="p1" xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
    <pml:dimensions>
        <pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
    </pml:dimensions>
    <p>
        Too many paragraphs...
    </p>
</pmlp:old>
Code 37 - How the engine pattern the document in Code 28

The rename operation is an application of both the wrap and the unwrap operations. For this reason, as we can see in Code 38, the result of this operation is a trade-off between the wrap and unwrap: the old element is replaced by an “pmlp:old” element with a qualified attribute “pmlp:rename” (that is an identifier). Instead the new inserted element (the object of the renaming) have specified a qualified “pmlp:renamed” attribute referred to the old element.

<?xml version="1.0" encoding="UTF-8"?>
<pmlp:old pmlp:name="p" pmlp:rename="p1" xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
    <pml:dimensions>
        <pml:structure pml:name="Ptable" pml:ref="//ul" pml:content="." />
        <pml:structure pml:name="Pcontainer" pml:ref="//li" pml:content="." />
        <pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
    </pml:dimensions>
    <ul pmlp:renamed="p1">
        <li><p>A list item;</p></li>
        <li><p>Another list item;</p></li>
        <li><p>But where is the list?</p></li>
    </ul>
</pmlp:old>
Code 38 - How the engine pattern the document in Code 30

In the end, the inject has a similar result to the rename. In fact, as we can see in Code 39, the element that applies the operation is replaced by a “pmlp:old” element (in which the attribute “pmlp:inject” is an identifier) as usual. All the new elements created by this operation have a qualified “pmlp:injected” attribute that refers to its “creator”.

<?xml version="1.0" encoding="UTF-8"?>
<div xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
    <pml:dimensions>
        <pml:structure pml:name="Pinline" pml:ref="//(b|i)" pml:content="." />
        <pml:structure pml:name="Pcontainer" pml:ref="//div" pml:content="." />
        <pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
    </pml:dimensions>
    <pmlp:old name="b" pmlp:inject="b1">
        <p>
            <b pmlp:injected="b1">Text with a <i>bold</i> style.</b>
        </p>
        <div>
            <p>
                <b pmlp:injected="b1">Another text with a <i>bold</i> style.</b>
            </p>
        </div>
    </pmlp:old>
</div>
Code 39 - How the engine pattern the document in Code 32

The latter issue that we present concerns the difference between syntactic and semantic patterning.

The syntactic patterning is able to pattern a document basing its operations on the document structure only. In this case we have many solutions to solve a patterning problem.

The semantic patterning issue is a little bit different. In this case the patterning works well if and only if the visualization of the patterned document and of the original document are the same. We can think about “visualization” like a browser displays a web document, taking into consideration that two web browsers can display a web page in different manners. For this reason it is not easy to understand what “same visualization” means. This is an hard problem to solve.

The first version of our engine is able to handle the syntactic patterning (because it is based on clear rules) and the semantic patterning excluding unusual scenarios. Now we are working on a new version of the engine based on milestones overlapping markup ( [AMP03]*) to handle as much semantic patterning scenarios as possible.

In this section we have introduced the patterning engine that we have developed in order to pattern any XML document preserving all its pml declarations. This transformation is made using PML with the new PMLp language that we use to pattern the document and to remember the old unpatterned structure. As we know, we can convert a PML+PMLp document into an IML document through a simple meta-XSLT. So, through the application of elISA 2.0, the patterning engine and the latter meta-XSLT, we can convert any web document in an IML document. This allow us to replace the old version of elISA into ISAWiki with our new version in order to handle XML segmented documents in according to the Pentaformat model [Dii07]*.

3.5 So what? (take 2)

In this chapter we have illustrated our main works related to the web documents segmentation in order to allow the extraction of data (the main technological context of this thesis). First of all we have introduced a model for a five dimensional segmentation of any document, called Pentaformat [Dii07]*. According to this model we have developed a language, the Pentaformat Markup Language (PML), that allows to specify some declarations to segment XML documents. The use of this language is the main innovation of the new version (2.0) of elISA [DVV04]*: while the old version allows to identify the content and some presentational elements of a web document, elISA 2.0 segments it according to the Pentaformat model using PML. In order to replace the old engine version with elISA 2.0 into the ISAWiki platform [DV04]*, we have developed another rule-based engine that patterns PML documents according to seven structural patterns [Gub04]*. The output of this engine is a new XML document based on PML and PML patterns. PMLp is a language developed for the restructuring of XML documents according to some patterning operations. With this latter step we can obtain from a PML+PMLp document - through a meta-XSLT [Kay07]* - an IML document. This kind of formats is what ISAWiki uses to stores documents.

In Chapter 4 we will re-introduce the two engines developed, elISA 2.0 and the patterning engine, in order to describe their implementation with much more details. After that we will introduce a web application, developed through the two engines and the Java Servlet Technology [Sun06a]*, called elISA Server Side. It provides an user interface to segment a specified web document using elISA 2.0 and, eventually, to transform the PML document (obtained in the segmentation phase) into an IML document.

4 Features hole and its monsters

In the last chapter, Chapter 3, we have introduced all the theories and the two tools that we have developed for this work in order to complete the goal introduced in Chapter 1: to develop rule-based mechanism to segment XML documents according to a five dimensional model, called Pentaformat [Dii07]*, in order to convert automatically them into new documents using one or more of constituents introduced by the model.

In this chapter we will deepen some concepts concerning elISA 2.0 and the patterning engine - introduced in Section 3.3 and Section 3.4 respectively - related to the specific architectures of these two engines and the specific Relax NG [Oas01]* grammars developed to make documents that the engines use to perform their processes. We will illustrate these components in Section 4.1 and in Section 4.2.

We can complete the goal of our thesis using an application that can combine these two engines. For this reason we have developed a web application called elISA Server Side that uses these engines and other meta-XSLT documents [Kay07]* to perform the transformation of a web document into an IML document via Pentaformat segmentation. This application is developed in Java 6.0 [Sun06b]* with Servlet technology [Sun06a]* and it has been tested on Tomcat 6.0.16. Through elISA Server Side we can specify an url of a web document to be analyzed according to some chooseable rules and thresholds in order to return a PML document. After this first analysis we can choose whether to get the PML document or to show it or to transform it into an IML document. We will explain this process in detail in Section 4.3.

4.1 elISA engine: the infrastructure

The engine that we have introduced in Section 3.3, called elISA 2.0, is a software that segments any web document according to the Pentaformat [Dii07]* - the model that we have illustrated in Section 3.1. This model allows to extract data identifying the roles that the elements of a web document may have, accordingly to the five dimensions: content, structure, presentation, behavior and metadata. The output of elISA 2.0 is a PML document (Section 3.2), i.e. an XML document with some pml declarations. As we have illustrated in Section 3.3, this engine needs two fundamental files which specify, respectively, the rules for defining pml declarations and the thresholds for selecting a particular subset of the formers.

In this section we will introduce in detail the infrastructure of elISA 2.0. In Section 4.1.1 we will describe which modules are part of it and how they work. Indeed, in Section 4.1.2, we will describe the structure of the rules document and of the thresholds document in order to understand how we can define them.

4.1.1 Three steps in five phases

The Picture 15 describes elISA 2.0 as an engine working through three main steps. Quite true. But two of these three steps are splitted in two phases for a total of five distinct phases, as we can see in Picture 16:

  1. the establish phase (step 1) we try to well form the input document through an external well-former, the HTMLCleaner version 1.6 by Vladimir Nikic. It is permitted to change the default well-former modifying a configuration file;
  2. the load phase (step 1) allows to load and execute on the input document a limitless number of plugins conforming to a known Java interface. Using these plugins we can add/remove informations to/from the well formed input document;
  3. through the indicate phase (step 2) the engine completes the first real analysis of the input document producing an intermediary document, written in a language called PML Qualifier. We use this language to specify all the probabilistic pml declarations deduced from a rule-set document;
  4. the next solve phase (step 3) is useful to re-write all the probabilistic pml declarations of the PMLQualifier document. In this phase we sum the weights as we have illustrated in Section 3.3.2, solving all the XPath 2.0 queries of the declarations in order to associate these declaration to any element;
  5. in the last acknowledge phase (step 3) the engine performs the choice of probabilistic pml declarations that will be presented in the final PML document.

During the first phase our goal is to get a web document, whether it's well-formed or not. In the first case the establish phase returns the same XML document without changes. Instead, in the second case, we can re-format in order to become well formed. This operation is performed by a specific plugin in a JAR archive [Sun03]* specified in the configuration file of the engine, as we can see in Code 40. Considering as context the elISA 2.0 path, “basepath” specifies where the engine can find the file “name” that contains the well former. Obviously, to create this plugin, the engine uses the class specified in the attribute “class”. This dynamic loading of Jar files (and the next pieces of code) is performed by means of Reflection [Sun02]*. It is a package that allows to examine or modify the runtime behaviour of a Java application.

Picture 16 - The five phases of elISA 2.0

<wellformer 
    basepath="wf"
    name="s-wellformer.jar"
    class="it.essepuntato.elisa.wellformer.HtmlCleanerWellFormer"
    method="run" />
Code 40 - The extract of the configuration file concerns the well-former

The method to invoke is specified in the attribute “method”. The goal of this method is to return an instance of a “org.w3c.dom.Document” that represents the XML document obtained from the input (not well-formed) document. In order to use correctly this external plugin, the main class of the package must comply to the specific Java interface introduced in Code 41. As we can see, the class that extends “IElisaWellFormer” has to implement two similar method that respectively get a string or the source file of the original document and return a well-formed XML document.

package it.essepuntato.elisa.wellformer;
import java.io.File;
import org.w3c.dom.Document;

public interface IElisaWellFormer {
    public Document run(String string);
    public Document run(File file);
}
Code 41 - The well-former JAVA interface of elISA 2.0

We have used an external plugin to specify the well former because when we want to change the well former with a new version or with another well former we will do it easily changing the configuration file only.

The second phase of the engine process - load - works as the first phase. We can specify some plugins adding it in the configuration file through the tag “plugin”. All these plugins are in the path specified by the attribute “basepath” of the tag “loader”. All the plugins specified in the configuration file (Code 42) will execute in ascending order. The attributes “name”, “class” and “method” are used in the engine as in the first phase. The “filespath” specifies the local directory - from the “basepath” of the “loader” - in which we put some files for the plugin. All the elements specified by the tags “param” represent the parameters of the plugin.

<loader basepath="plugin">
    <plugin 
        type="jar" 
        name="node-enumeration.jar" 
        class="it.essepuntato.elisa.plugin.NodeEnumeration" 
        method="run" 
        filespath="NodeEnumeration">
        <param><key>xslt</key><value>NodeEnumeration.xsl</value></param>
    </plugin>
</loader>
Code 42 - The extract of the configuration file concerns the plugin loader

The goal of every plugin is to get an XML document, to elaborate it and to return a new XML document with (probably) some changes. Obviously, in order to allow a correct work, all plugins must comply to the specific Java interface that we can see in Code 43. The method “run”, invoked by the engine, get three parameters: the XML document to be processed, the path in which there are some files for the plugin and a “Map” in which there are specified all the parameters.

package it.essepuntato.elisa.plugin;
import java.io.File;
import java.util.Map;
import org.w3c.dom.Document;

public interface IElisaPlugin {
   public Document run(Document dom, File filePath, Map<String,String> params);
}
Code 43 - The plugin JAVA interface of elISA 2.0

Using this loading phase to execute plugins, we can remove or add data to the input XML document in order to improve the quality of the identification of roles. We can add some information to some elements temporarily - in order to use them in the process only - through qualified attributes: their namespaces must be “http://www.essepuntato.it/PMLLoad”, and they must be specified by the prefix “load”. We use this namespace and prefix in order to remove these added data in the final PML document.

The next three phases - indicate, solve and acknowledge - are made with three different meta-XSLT:

  • in the third phase we use a meta-XSLT with a document having some rules in order to create a new XSLT to indicate the probably roles of the elements of the input XML document. The output is a document written in a intermediary PML language called PML Qualifier. Using this language we can specify probabilistic pml declarations to all elements of the document;
  • in the fourth phase we apply the input document to a meta-XSLT and to the resulting XSLT. Through the first application we create an XSLT on the basis of the XPath expressions of all probabilistic pml declarations. In the second application we solve the XPath of all probabilistic pml declarations and we sum their “weight” value among identical declarations;
  • in the latter phase we choose what declarations we keep in the output PML document. In order to do it we use a meta-XSLT applied to some thresholds in order to make another XSLT document used to transform the input “PML Qualifier” document in a PML document.

In this section we have analyzed what are the five phases that compose the elISA 2.0 process. All these phases are needed to complete the transformation of a web document - well formed or not - into a PML document on the basis of some rules and thresholds. The two documents to define these rules and thresholds are based on specific Relax NG [Oas01]* + Schematron [Jel05]* grammar. We describe in depth these two grammars in Section 4.1.2.

4.1.2 Rules and thresholds

In Section 4.1.1 we have introduced the five phases of the elISA 2.0 process. As we have seen, in order to complete these phases, we need two XML documents in which we have specified rules and thresholds. We have seen two examples of these two documents in Section 3.3.2 (Code 17 and Code 18). In this section we analyze some aspects of their grammar in order to understand how we can specify rules and thresholds. We introduce all examples using the Relax NG compact syntax [Oas01]*.

The document element of a rules document is “rules”. The structure of this element is simple: we can define some variables using “call”; further we can define some macros (related to three specific conditional elements: “check”, “whenever” and “test”) using the content model expressed for statements; obviously we have to define at least one “rule” element, as we can see in Code 44.

rules = 
    element rules { 
        call*, 
        CM.statement, 
        rule+ 
}

CM.statement = (statementcheck | statementwhenever | statementtest)*
Code 44 - The element “rules”

An element “rule” (Code 45) is characterized by two attributes: “context” is an XPath 2.0 expression that defines what elements are related to the rule, while “deep” is a boolean that specifies if the analysis must continue or not for all the children of the context. Other than optional and repeatable elements “call” and the statement groups, we can give some probabilistic pml declarations using the set elements. All these elements have four attributes - “name”, “ref”, “content” and “weight” - used to specify the declarations.

rule =
    element rule {
        attribute.context,
        attribute.deep?,
        call*,
        CM.statement,
        (
            setContent      |
            setStructure    |
            setPresentation |
            setMetadata     |
            setBehavior
        )*,
        (check | refcheck)*
    }
    
check = 
    element check { 
        (whenever | refwhenever)+, 
        otherwise?
    }
Code 45 - The element “rule”

The last elements - that we can use to structure a rule refer to conditional expressions - are called “check”. Into these elements we can specify one or more if statements using “whenever” with some conditions defined by XPath 2.0 expressions. The optional element “otherwise” represents a sort of else for all the previous “whenever”s: if all their conditions are not satisfied then we apply the “otherwise” block. The content models of the elements “whenever” and “otherwise” is the same, except for conditional elements of the first one (the “test” attribute or the “test” element). Both can contain probabilistic pml declarations - defined by “setContent”, “setStructure” and so on - and other “whenever”/“otherwise” elements.

The grammar to define thresholds is easier than the rules grammar. A thresholds document, as we can see in Code 46, has to begin with an element “thresholds” without attributes. It contains one or more elements “threshold”. These last kinds of elements specify a context through the attribute “context” - an XPath 2.0 expression - in order to define what elements it refers to. The optional attribute “priority” defines a priority (XSLT [Kay07]* like) among all the thresholds referred to the same element. The five interleaved elements, called as the Pentaformat [Dii07]* dimensions, allow to specify threshold values for all probabilistic pml declarations referred to the context.

thresholds =
    element thresholds { threshold+ }

threshold =
    element threshold {
        attribute.context,
        attribute.priority?,
        (
            content?      & 
            structure?    & 
            presentation? & 
            behavior?     & 
            metadata?
        )
    }
Code 46 - The elements “thresholds” and “threshold”

As we can see in Code 47, the content models of these five elements are similar. Each of them may contain a best-weight threshold or one or more conditional selector. The first type of thresholds specifies that the declaration with the best weight wins. In case of standoff we select the declaration considering the name priority specified with the attribute “priority” of “bestWeight”.

content = 
    element content { 
        bestWeight.content | select.content+ 
    }

structure =
    element structure { 
        bestWeight.structure | select.structure+ 
    }

presentation =
    element presentation { 
        bestWeight.presentation | select.presentation+ 
    }

behavior = 
    element behaviour { 
        bestWeight.behavior | select.behavior+ 
    }

metadata = 
    element metadata { 
        bestWeight.metadata | select.metadata+ 
    }
Code 47 - The dimensional elements

The content model of any element “select” can contain an element “weight” (that specifies a number and a comparison operator) and an optional element called “name” with a value related to a dimension. Every element “select” of a threshold is referred to a dimension and represents a conditional expression; all these expressions are connected using the operator “or” in order to form a conditional XPath 2.0 expression composed by all the sibling elements “select”. To understand this we introduce a little example in Code 48. With the threshold referred to all elements “p” we save all the probabilistic pml declarations that refer to “p” and that concern the dimension content. In addition to this, the conditional expression (weight >= 0.7 and name = 'Text') or (weight > 0.9) must be true.

<?xml version="1.0" encoding="UTF-8"?>
<thresholds xmlns="http://www.essepuntato.it/Thresholds">
    <threshold context="p">
        <content>
            <select>
                <weight ge="0.7"/>
                <name value="Text"/>
            </select>
            <select>
                <weight gt="0.9"/>
            </select>
        </content>
    </threshold>
</thresholds>
Code 48 - An example of multiple select for the element “content”

Obviously all the probabilistic pml declarations that are not associated to any threshold are not saved into the final PML document.

In this section we have analyzed the two grammars related to rules and thresholds in order to understand how we can define them. All the matters explained in Section 3.3, in Section 4.1.1 and in this one conclude the issues about elISA 2.0. In Section 4.2 we will re-introduce in depth all that matters regarding the patterning process.

4.2 Pattern engine: the infrastructure

The engine that we have introduced in Section 3.4 allows to pattern PML documents (presented in Section 3.2) according to some patterning rules specified in an XML document. To pattern PML documents is fundamental if we want to convert them into IML documents [San06]* in order to use elISA 2.0 in the ISAWiki platform (Section 3.3.1). In fact, as we have illustrated in Section 3.4.3, the main difference between PML and IML is the patterned structure of the latter.

In this section we re-discuss the patterning engine introduced in Section 3.4 and deepen some issues concerning the grammar to define patterning rules (Section 4.2.1) and we present an important configuration file (Section 4.2.2).

4.2.1 How to define a patterning rule-set

The content model which complies to the document element “patterns” has a structure similar to the grammar that defines rules for elISA 2.0 (Section 4.1.2). As we can see in Code 49, it is formed by a sequence of global variables (XSLT like), by a sequence of conditional macros (elements “statement”) and by one or more patterning rules.

patterns = 
    element patterns { 
        cm.patterns 
    }

cm.patterns = 
    variable*, 
    statement*, 
    pattern+
Code 49 - The element “patterns”

These rules are defined by the element “pattern”. As we see in Code 50, a rule is characterized by the attribute “match” (defining what elements refer to this rule) and by an optional attribute “priority” that specifies a priority value for it. Other than the two sequences of variables and conditional macros, the element “pattern” must specify an element “choose” in order to apply some patterning operations.

pattern = 
    element pattern { 
        cm.pattern 
    }

cm.pattern =
    attribute.match,
    attribute.priority?, 
    variable*, 
    statement*, 
    choose
Code 50 - The element “pattern”

The content model of “choose” is simple: it contains a sequence of if/else-if blocks defined by some “when” elements or references to conditional macros. The condition of the elements “when” is expressed by the attribute “test” according to XPath 2.0 expressions [BBC07a]*. Other than the variables, in this element we can choose whether to use other conditional block or to apply some patterning operations.

choose = 
    element choose { 
        cm.choose 
    }
    
cm.choose = 
    (when | ref)+

when = 
    element when { 
        cm.when 
    }

cm.when =
    attribute.test,
    variable?,
    (
        (unwrap | inject | wrap | rename) | 
        choose
    )
Code 51 - The elements “choose” and “when”

As we have seen in Section 3.4.4, we can use one of the four patterning operations: wrap, unwrap, inject and rename. As we can see in Code 52, all these operations, except rename, support the use of multiple wrap declarations. Any wrap is based on three attributes whereof two are optional. The attribute “pattern” - as in rename - specifies what element we use to wrap the elements specified by the optional attribute “select”. If we do not specify some selections then we consider all elements as operation subject. The optional attribute “force” is useful if we want to specify an order between this new element and the inherited elements to bo inserted specified by previous inject operations.

unwrap = 
    element unwrap { 
        cm.operation 
    }
        
inject = 
    element inject { 
        cm.operation 
    }

cm.operation = 
    subwrap*

rename = 
    element rename { 
        cm.rename 
    }
    
cm.rename = 
    attribute.pattern

wrap = 
    element wraps { 
        cm.wrap 
    }

cm.wrap = 
    subwrap+

subwrap = 
    element wrap { 
        cm.subwrap 
    }

cm.subwrap =
    attribute.pattern, 
    attribute.select?, 
    attribute.force.wrap?, 
    subwrap*
Code 52 - The elements specifying the patterning operations

To understand what is the semantic of a multiple wrap we present the example in Code 53. In this rule we use three different wraps. Through wrap 1 we wrap, with a container, all the sequences of inline and text nodes. All the elements of these sequences are still wrapped by a block using the wrap 2 operation. The wrap 3 is applied to all elements except those elements selected by previous “wrap” siblings: in this case (element()|text()) except (element()[f:isInline(.)]|text()).

<?xml version="1.0" encoding="UTF-8"?>
<patterns xmlns="http://www.essepuntato.it/Patterns">
    <pattern match="body">
        <choose>
            <when test="exists(element()|text())">
                <wraps>
                    <!-- wrap 1 -->
                    <wrap pattern="Pcontainer" select="element()[f:isInline(.)]|text()">
                        <!-- wrap 2 -->
                        <wrap pattern="Pblock" />
                    </wrap>
                    <!-- wrap 3 -->
                    <wrap pattern="Pcontainer" />
                </wraps>
            </when>
        </choose>
    </pattern>
</patterns>
Code 53 - An example of multiple wrap

To understand how this patterning rule works we introduce an example. As we can see in Code 54, we have a non-patterned document composed by some nodes (text, inline and block). The goal is to pattern the document in order to have the “body” document element patterned by some containers, such as “div”, and each of them patterned by a block, such as “p”.

<?xml version="1.0" encoding="UTF-8"?>
<body xmlns:pml="http://www.essepuntato.it/PML">
    <pml:dimensions>
        <pml:structure pml:name="Pcontainer" pml:ref="//body" pml:content="." />
        <pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
        <pml:structure pml:name="Pinline" pml:ref="//(em|q|b)" pml:content="." />
    </pml:dimensions>
    This is a little <em>example</em> to understand how we can
    use a multi wrap.
    <p>
        We want to obtain the element <q>body</q> as a sequence
        of container such as <q>div</q>.
    </p>
    With one simple <b>patterning rules</b> we can obtain a patterned
    new document.
</body>
Code 54 - A non-patterned document

The application of the rule introduced in Code 53 to the document in Code 54 makes the patterned document in Code 55. In this new document all the text and inline nodes are patterned with the application of wrap 1 and wrap 2. The only block element in the old document is patterned using the wrap 3 operation instead.

<?xml version="1.0" encoding="UTF-8"?>
<body pmlp:wrap="body1" xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
    <pml:dimensions>
        <pml:structure pml:name="Pcontainer" pml:ref="//body|//div" pml:content="." />
        <pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
        <pml:structure pml:name="Pinline" pml:ref="//(em|q|b)" pml:content="." />
    </pml:dimensions>
    <!-- wrap 1 -->
    <div pmlp:wrapped="body1">
        <!-- wrap 2 -->
        <p pmlp:wrapped="body1">
            This is a little <em>example</em> to understand how we can
            use a multi wrap.
        </p>
    </div>
    <!-- wrap 3 -->
    <div pmlp:wrapped="body1">
        <p>
            We want to obtain the element <q>body</q> as a sequence
            of container such as <q>div</q>.
        </p>
    </div>
    <!-- wrap 1 -->
    <div pmlp:wrapped="body1">
        <!-- wrap 2 -->
        <p pmlp:wrapped="body1">
            With one simple <b>patterning rules</b> we can obtain a patterned
            new document.
        </p>
    </div>
</body>
Code 55 - A patterned version of Code 54 using the patterning rules specified in Code 53

As we have seen in more examples about documents with patterning rules (Code 34 and Code 53), we can use some functions, such as f:isInline or f:isBlock, that allow to interact with all pml declarations of the input document. These functions are splitted in two categories with respect to patterns and to Pentaformat dimensions:

  • f:isMarker, f:isAtom, f:isInline, f:isBlock, f:isContainer, f:isTable, f:isRecord. These functions get an element in input and return a boolean value that identifies whether the input belongs to the specified pattern or not;
  • f:isContent, f:isStructure, f:isPresetation, f:isBehaviour, f:isMetadata. These functions get an element in input and return a boolean value that identifies whether the input belongs to the specified dimension or not. We can use other five functions - f:hasContentName, f:hasStructureName, f:hasPresentationName, f:hasBehaviourName, f:hasMetadataName - that get in input an element and a name sequence and return a boolean value that identifies whether the input element belongs to the specified dimensions with at least one name of the input sequence. The value of each name of the input sequence is correlated to the respective values of the attribute “name” that a pml declaration may have.

In this section we have analyzed the grammar to define patterning rules introducing some examples. In Section 4.2.2 we will illustrate the architecture of the patterning engine in order to introduce some aspects related to the configuration file.

4.2.2 The configuration file

After we have understood (Section 4.2.1) how we can specify the patterning rules in order to pattern a PML document and how all the patterning operations work, in this section we illustrate the architecture of the patterning engine, analyzing its main features.

The patterning engine that we have developed is a Java application based on a meta-XSLT document [Kay07]*. As we can see in Picture 17, its goal is to get an input PML document in order to return a patterned PML+PMLp document. As we have introduced in Section 3.4, the PML+PMLp document, returned by the engine, is an XML document with the old non-patterned structure expressed by PMLp elements and attributes. Other than the input PML document, there are other two documents that the engine uses to allow the transformation. The former is the document with patterning rules written complying to the grammar introduced in Section 4.2.1.

Picture 17 - The patterning process

The latter, called “definitions.xml”, is used for three reasons:

  • to define a translation for all the values of the attribute “name” of a pml declaration (concerning the structure) into a specific element;
  • to specify a sort of ontology in order to understand what pattern any structure “name” value is referred to, as we have illustrated in Section 3.4.5;
  • to describe the content model of all structural patterns.

As we can see in Code 56, an element is associated to every name value of the structure dimension. Every element has an attribute “name” that specifies how the structure must be translated. All elements are arranged in a sort of ontology in which the main classes are represented by the structure “name” of the seven patterns. All other structure elements are children of only one of them. We use this part of “definitions.xml”, on one hand to specify how to translate an element referred to a specific name of the Pentaformat structure; on the other hand we use it to understand what pattern is associated to a specific element.

<definitions>
    <Pblock name="p">
        <Paragraph name="p" />
        <Heading name="h1" />
    </Pblock>
    
    <Pinline name="span">
        <Generic name="span" />
        <Link name="a" />
        <Strong name="b" />
        <Citation name="q" />
        <Subscript name="sub" />
        <Superscript name="sup" />
        <Emphasis name="i" />
    </Pinline>
    
    <Patom name="span" />
    
    <Pmarker name="span">
        <Image name="img" />
        <Meta name="meta" />
    </Pmarker>
    
    <Pcontainer name="div">
        <Head name="head" />
        <Body name="body" />
        <Divider name="div" />
        <Object name="object" />
        <ListItem name="li" />
        <TableRow name="tr" />
        <TableHeader name="th" />
        <TableCell name="td" />
    </Pcontainer>
    
    <Ptable name="table">
        <Table name="table" />
        <List name="ul" />
    </Ptable>
    
    <Precord name="div">
        <Root name="html" />
    </Precord>
    
    <GeneralStructure name="div" />
</definitions>
Code 56 - Definitions for the “name” attribute of the dimension structure

The second part of “definitions.xml” concerns the content model of every pattern. It is used to understand, during the patterning process, whether the operation may be applied or not to an element. For example, let us suppose to apply a rename to a “Divider” to transform it into a “Paragraph”, i.e. we want to transform a container in a block. This operation is allowed if and only if the content model of the father of the “Divider” accepts block elements as children. The engine handles this issue according to the content models specified in “definitions.xml”. So, if the father of the “Divider” does not allow a block in its content model then the engine does not apply the rename.

<contentModels>
    <Pblock>
        <text />
        <comment />
        <processing-instruction />
        <Pinline />
        <Patom />
        <Pmarker />
    </Pblock>
    
    <Pinline>
        <text />
        <comment />
        <processing-instruction />
        <Pinline />
        <Patom />
        <Pmarker type="milestone"  />
    </Pinline>
    
    <Patom>
        <comment />
        <processing-instruction />
        <text />
    </Patom>
    
    <Pmarker />
    
    <Pcontainer>
        <comment />
        <processing-instruction />
        <Ptable />
        <Precord />
        <Pmarker type="meta" />
        <Pblock />
        <Pcontainer />
        <Patom />
    </Pcontainer>
    
    <Ptable>
        <comment />
        <processing-instruction />
        <Ptable />
        <Precord />
        <Pmarker type="meta" />
        <Pblock />
        <Pcontainer />
        <Patom />
    </Ptable>
    
    <Precord>
        <comment />
        <processing-instruction />
        <Ptable />
        <Precord />
        <Pmarker type="meta" />
        <Pblock />
        <Pcontainer />
        <Patom />
    </Precord>
</contentModels>
Code 57 - Content model for all the patterns

In this section we have illustrated how the patterning process works and what documents are important for the engine process. In Section 4.3 we will explain how the two engines - elISA 2.0 (Section 3.3 and Section 4.1) and the patterning engine (Section 3.4 and Section 4.2) - are included in a web applications, called elISA Server Side, that we use to segment (and eventually to transform) web documents.

4.3 elISA Server Side

In this section we introduce a web application developed using Java 6.0 [Sun06b]* with Servlet technologies [Sun06a]*. This application, called elISA Server Side, includes all the technologies developed for this work and completes the goal introduced in Chapter 1: to make rule-based mechanism to segment XML documents according to a five dimensional model called Pentaformat [Dii07]* in order to automatically convert them in new documents using one or more of constituents introduced by the model.

The goal of this application is to analyze a web document, specified by an url [CCD01]*, in order to return a PML document or another type of document obtained from the output of elISA 2.0. As we can see in Picture 18, the first step of this web application is to apply elISA 2.0 to the url-specified web document, as we have illustrated in Section 3.3 and Section 4.1.

Picture 18 - elISA Server Side architecture

To use elISA 2.0 for this analysis we choose which rules and thresholds we want. As we can see in the first screenshot of the web application in Picture 19, we can choose them among some XML files specified by elISA Server Side.

After this analysis we can decide what to do with the PML document returned by elISA 2.0. Actually, as we can see in the second screenshot of the application in Picture 20, we can choose between four different options:

  • to get the PML document as is;
  • to show the PML document locally on browser, using an XSLT transformation that fixes some visualization problems, such as the base path for relative uri and so on;
  • to color some pml declarations in the PML document, showing the result on browser using a meta-XSLT transformation that colors all nodes belonged to some Pentaformat dimensions;
  • to transform the PML document into an IML document, using the patterning engine (Section 3.4 and Section 4.2) to pattern it and a meta-XSLT to transform the PML+PMLp document into an IML document [San06]*.

Picture 19 - elISA 2.0 engine in elISA Server Side

Picture 20 - Four kinds of chooses in order to return the PML document

The transformation of the PML document into an IML document is performed in two steps. Firstly, we pattern the PML document using the patterning engine. Secondly, using as input the PML+PMLp document returned by the previous step, we transform it into an IML document using a simple meta-XSLT document. This stylesheet considers only the elements that contain some nodes or that are content nodes, in order to translate the elements according to IML. In Code 58 we can see an extract of the IML document returned by a complete elISA Server Side process applied to an article of Wikipedia.

<iml xmlns="http://www.cs.unibo.it/2006/iml" xml:lang="en">
    <body class="mediawiki ns-0 ltr page-Web_3_0">
        <div id="globalWrapper">
            <div id="column-content">
                <div id="content">
                    <h1 class="firstHeading">Web 3.0</h1>
                    <div id="bodyContent">
                        <h3 id="siteSub">From Wikipedia, the free encyclopedia</h3>
                        <p><b>Web 3.0</b> is a term used to describe the future of the <a
                                href="/wiki/World_Wide_Web">World Wide Web</a>. Following the
                            introduction of the phrase "<a href="/wiki/Web_2.0">Web 2.0</a>" as a
                            description of the recent evolution of the Web, many technologists,
                            journalists, and industry leaders have used the term "Web 3.0" to
                            hypothesize about a future wave of Internet innovation.</p>
                        <p>Views on the next stage of the World Wide Web's evolution vary greatly.
                            Some believe that emerging technologies such as the <a
                                href="/wiki/Semantic_Web">Semantic Web</a> will transform the way
                            the Web is used, and lead to new possibilities in <a
                                href="/wiki/Artificial_intelligence">artificial intelligence</a>.
                            Other visionaries suggest that increases in Internet connection speeds,
                            modular <a class="mw-redirect" href="/wiki/Web_applications">web
                                applications</a>, or advances in <a href="/wiki/Computer_graphics"
                                >computer graphics</a> will play the key role in the evolution of
                            the World Wide Web.</p>
                            
                        [...]
                        
                    </div>
                </div>
            </div>
        </div>
    </body>
</iml>
Code 58 - An extract from the transformation of the Wikipedia article “Web 3.0” into an IML document

In this section we have introduced the web application called elISA Server Side. As we have seen, this application performs a complete transformation from a web document to an IML document. It is possible former to segment the input document according to the Pentaformat model [Dii07]* and latter to pattern the PML document returned in the first step, according to the seven structural patterns [DDD07]*. These transformations - from web documents to PML documents and from PML documents to IML documents - represent the main goal of our work introduced in Chapter 1.

4.4 Summarizing all the infrastructures

In this chapter we have deepened some concepts about the two engines introduced in Chapter 3: elISA 2.0 and the patterning engine. We have introduced their architectures and we have illustrated all the Relax NG [Oas01]* grammars that we use to write XML documents with rules and thresholds (for elISA 2.0) and patterning rules (for the patterning engine). We have introduced some examples in order to understand how we can write these documents and we have illustrated some aspects about an important configuration file referred to the patterning engine.

The new tool, introduced in this chapter is developed through our two engines. It is a web application called elISA Server Side that allows to segment web documents in order to return PML documents or other documents in different formats. Through this web application we can complete the goal of this thesis: to develop rule-based mechanism to segment XML documents according to a five-dimensional model called Pentaformat [Dii07]* in order to automatically convert them into new documents exploiting one or more of the constituents introduced by the model. In particular, using the patterning engine, we can convert web documents into IML documents.

In the conclusions of this dissertation (Chapter 5) we will re-discuss all the issues, the theories and the tools introduced in this chapter and in all the preceding chapters (Chapter 1, Chapter 2, Chapter 3) in order to suggest some future developments for these arguments.

5 Happily ever after (or Conclusions)

As we have seen in Chapter 1, the claim of this thesis is to develop a rule-based mechanism to segment XML documents according to a five-dimensional model called Pentaformat [Dii07]* in order to convert automatically them in new documents using one or more of constituents introduced by the model: content, structure, presentation, behaviour and metadata.

We have introduced some phases to complete this conversion. First of all, in Chapter 2 we have spent some words to talk about the extraction of data. This is the main context in which we have worked to develop our tools. In particular we have introduced the concept of content extraction, clarifying what we intuitively mean with the word “content” in the context of web pages (what the authors have written or what users search googling). After this first definition we have explain that not all the elements of a document, such as a web document, belong to the content. For example in a common web page we can find layout tables, logos, banners that we can consider as _presentational items_rather than content.

Other than the separation between content and presentation, we have introduced another role for the elements of a web document: the metadata. In a common sense, we define metadata as assertions about the document. We can specify metadata for a web page using some techniques, from using the (X)HTML tag “meta” to specifying a semantic assertions through some Semantic Web [BHL01]* technologies such as OWL [BDH04]*, RDF [BM04]*, RDFa [AB07]* and microformats [All07]*.

After this brief introduction about the roles that the elements of a web document can have, we have illustrated some tools and theories related to content extraction, showing how they work. The main goal of these tools is to identify whether the elements of a web page are or are not content, leaving out the recognition of roles for all non-content elements. In our opinion, this is the main shortcoming of these works. We think the extraction of content is important as well as the identification of the roles of the remaining non-content elements.

To fill up this shortcoming, we have developed a rule-based engine, called elISA 2.0 (Extraction of Layout Information via Structural Analysis 2.0), to segment XML documents according to the Pentaformat model. As we have illustrated in Chapter 3, this engine is based on a language, called PML (Pentaformat Markup Language), that allows to identify the roles of the web page elements using easy declarations. The specific context in which we want to use this engine is the ISAWiki platform [DV04]*. This is a client/server application that lets signed users edit any web page and store it in an appropriate server. In order to identify what parts of a web document users can modify, this platform actually uses an old version of elISA [DVV04]* for the segmentation of all web documents according to two main dimension: content and (a small set of) presentation. Our goal is to replace this old version of the engine with the new elISA 2.0 in order to segment XML documents according to the Pentaformat model.

Unfortunately PML - the output of elISA 2.0 - is not the format which ISAWiki uses to store documents. All documents in this platform are stored in an intermediary language called ISAWiki Markup Language [San06]*. IML is a language based on a specific structural pattern model [DDD07]*. It allows to structure the content of a document according to seven structural pattern: marker, atom, inline, block, container, table, record. The problem is that PML does not comply to the structural pattern model used in IML. To allow the transformation of a PML document into an IML document we have needed to develop another rule-based engine. This tool, called patterning engine, allows to pattern XML documents according to the patterning rule set specified. The output of this engine is a patterned XML document that can be transformed easily, thanks to a meta-XSLT document [Kay07]*, into an IML document.

We have analyzed in depth this conversion process in Chapter 4. In this chapter we have re-discussed the issues concerning the segmentation and the patterning of XML documents in order to explain the infrastructure of the two engine that we have developed. After this explanation, we have introduced a web application that performs this conversion: elISA Server Side. This is a web application - developed in Java 6.0 [Sun06b]* using the Servlet technology [Sun06a]* - that includes our engines to transform web documents into IML documents. This application realizes the claim proposed in Chapter 1.

A possible future work related to our thesis is to extend the patterning engine in order to allow the semantic patterning. In our context, this type of pattering must:

  • pattern the XML document obtaining a new XML document (syntax constraint);
  • specify in the patterned document the non-patterned structure of the old document (structure constraint);
  • allow the same visualization for both the patterned and the non-patterned documents (semantic constraint).

If the first two points are enough to allow the syntactic patterning, the latter is fundamental to guarantee the semantic patterning. In most scenarios a syntactic patterning of a document is enough to have a semantic patterning too. Instead there are some (very unusual) scenarios, for example the document in Picture 21, in which the syntactic patterning performed by our patterning engine changes the visualization between the input and the output document.

Picture 21 - A non-patterned XML document

As we have illustrated, a constraint for the PML+PMLp document is to specify the non-patterned structure of the old document. In order to return a patterned XML document complies to this constraint, the patterning engine adds some elements to the original document, as we can see in Picture 22. Unfortunately the patterning specified in this document is not a semantic patterning because the visualization of the old (Picture 21) and the new document (Picture 22) is not the same.

A way to specify the old non-patterned structure patterning the document semantically is performed through an overlapping markup technique called milestone [AMP03]*. As we can see in Picture 23, overlapping the element through this technique we can obtain a correct semantic patterning for the document in Picture 21 storing the old non-patterned structure too.

Picture 22 - A syntactic patterning for the document in Picture 21

Picture 23 - A semantic patterning for the document in Picture 21

Our goal is extend the current version of the patterning engine in order to make possible the semantic patterning for any XML document. This goal represent the main future work for this thesis.

Bibliography

[AB07] B. Adida, M. Birbeck - RDFa Primer: Embedding Structured Data in Web Pages - W3C Working Draft - 26 October 2007

[AHR01] H. Alam, R. Hartono, A. F. R. Rahman - Content Extraction from HTML Documents - International workshop on the Web Document Analysis (WDA 2001), Seattle, WA, USA - 8 September 2001

[AKK06] W. Abramowicz, T. Kaczmarek, M. Kowalkiewicz, M. E. Orlowska - Robust Web Content Extraction - 15th International World Wide Web Conference (WWW 2006), Edinburgh, Scotland, UK - 23-26 May 2006

[AMP03] L. Arévalo, J. C. Manzano, A. Polo, M. Salas - Multiple Markups in XML Documents - Springer Press, Proceedings of the International Conference on Web Engineering (ICWE '03), pp. 222-225, Oviedo, Spain - 14-18 July 2003

[Ale79] C. Alexander - The Timeless Way of Building - Oxford University Press - 1979

[All07] J. Allsopp - Microformats: Empowering Your Markup for Web 2.0 - Friends of ED Press - March 26 2007

[BBC07a] A. Berglund, S. Boag, D. Chamberlin, M. F. Fernández, M. Kay, J. Robie, J. Siméon - XML Path Language (XPath) 2.0 - W3C Recommendation - 23 January 2007

[BBC07b] M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland - Open Information Extraction from the Web - Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2670-2676, Hyderabad, India - 6-12 January 2007

[BCC02] M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, D. Pinto, X. Wei - Quasm: a system for question answering using semi structured data - ACM, Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries (JCDL '02), pp. 46-55, Portland, OR, USA - 14-18 July 2002

[BCH07] B. Bos, T. Çelik, I. Hickson, H. Wium Lie - Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) - W3C Candidate Recommendation - 19 July 2007

[BCL04] S. Byrne, M. Champion, P. Le Hégaret, A. Le Hors, G. Nicol, J. Robie, L. Wood - Document Object Model (DOM) Level 3 Core Specification Version 1.0 - W3C Recommendation - 07 April 2004

[BDH04] S. Bechhofer, M. Dean, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness?, P. F. Patel-Schneider, G. Schreiber, L. A. Stein - OWL Web Ontology Language Reference - W3C Recommendation - 10 February 2004

[BHL01] T. Berners-Lee, J. Hendler, O. Lassila - The Semantic Web - Scientific American Magazine - 2001

[BHL06] T. Bray, D. Hollander, A. Layman, R. Tobin - Namespaces in XML 1.0 (Second Edition) - W3C Recommendation - 16 August 2006

[BM04] D. Beckett, B. McBride? - RDF/XML Syntax Specification (Revised) - W3C Recommendation - 10 February 2004

[BP00] S. Brin, L. Page - The Anatomy of a Large-Scale Hypertextual Web Search Engine - Ph.D. thesis, Computer Science Department, Stanford University, Stanford, CA, USA - 2000

[Bag04] M. Bagnasco - Progettazione e implementazione di funzionalità di analisi di strutture di una pagina all'interno di un editor HTML client-side - Undergraduate thesis, Computer Science Department, University of Bologna, Bologna, Italy - 2004

[CCD01] T. Coates, D. Connolly, D. Dack, L. Daigle, R. Denenberg, M. Dürst, P. Grosso, S. Hawke, R. Iannella, G. Klyne, L. Masinter, M. Mealling, M. Needleman, N. Walsh - URIs, URLs, and URNs: Clarifications and Recommendations 1.0 - Report from the joint W3C/IETF URI Planning Interest Group, W3C Note 21 September - 21 September 2001

[CD99] J. Clark, S. DeRose? - XML Path Language (XPath) 1.0 - W3C Recommendation - 16 November 1999

[CGG04] M. F. Chiang, P. Grimm, S. Gupta, G. E. Kaiser, J. Starren - Automating Content Extraction of HTML Documents - Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal - 26-28 May 2004

[CJV99] W. Chisholm, I. Jacobs, G. Vanderheiden - Web Content Accessibility Guidelines 1.0 - W3C Recommendation - 5 May 1999

[CMO05] S. Cassidy, C. Mantratzis, M. Orgun - Separating xhtml content from navigation clutter using dom-structure block analysis - ACM, Sixteenth ACM Conference on Hypertext and Hypermedia (HYPERTEXT '05), pp. 145-147, Salzburg, Austria - 6-9 September 2005

[CNS07] W. Choochaiwattana, W. Niranatlamphong, M. B. Spring - Web image classification algorithm: a heuristic rule-based approach - Vic Grout, editors, Proceedings of the Second International Conference on Internet Technologies and Applications (ITA '07), pp. 201-207, Wrexham, North Wales, UK - 4-7 September 2007

[Car65] L. Carroll - Alice's Adventures in Wonderland and Through the Looking-Glass - Published by the Penguin Group in 1998 (centenary edition) - 1865

[Cla99] J. Clark - XSL Transformations (XSLT) - W3C Recommendation - 16 November 1999

[DDD07] A. Dattolo, A. Di Iorio, S. Duca, A. A. Feliziani, F. Vitali - Patterns for descriptive documents: a formal analysis - Technical Report UBLCS-2007-13, Computer Science Department, University of Bologna, Bologna, Italy - 2007

[DGK02] M. Diligenti, M. Gori, M. Kovacevic, V. Milutinovic - Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification - IEEE Computer Society, Proceeding of IEEE International Conference on Data Mining (ICDM 2002), pp. 250-257, Maebashi City, Japan - 9-12 December 2002

[DV03] A. Di Iorio, F. Vitali - A Xanalogical Collaborative Editing Environment - Proceedings of the Second International Workshop of Web Document Analysis (WDA 2003), Edinburgh, Scotland, UK - 3 August 2003

[DV04] A. Di Iorio, F. Vitali - Writing the web - Journal of Digital Information '04 - 5 may 2004

[DV05] A. Di Iorio, F. Vitali - From the Writable Web to the Global Editability - ACM, Proceedings of the sixteenth ACM conference on Hypertext and hypermedia (HYPERTEXT '05), pp. 35–45, New York, NY, USA - 2005

[DVV04] A. Di Iorio, E. Ventura Campori, F. Vitali - Rule-based structural analysis of web pages - In Simone Marinai and Andreas Dengel Eds., editors, Document Analysis VI, volume 3163 of Lecture Notes in Computer Science, pages 425–437, Springer Verlag - 2004

[Dii07] A. Di Iorio - Pattern-based Segmentation of Digital Documents: Model and Implementation - Ph.D. thesis, Computer Science Department, University of Bologna, Bologna, Italy - 2007

[EFH02] A. K. Elmagarmid, J. Fan, M. Hacid, X. Zhu - Model-Based Video Classification toward Hierarchical Representation, Indexing and Access - ACM, Multimedia Tools and Applications, Volume 17 (Issue 1), pp. 97–120 - 2002

[FKS01] A. Finn, N. Kushmerick, B. Smyth - Fact or fiction: content classification for digital libraries - Proceedings of the Second DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland - 18-20 June 2001

[Flo05] L. Floridi - Semantic Conceptions of Information - Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy (Winter 2005 Edition), Available online - 2005

[GHJ95] E. Gamma, R. Helm, R. Johnson, J. Vlissides - Design Patterns: Elements of Reusable Object-Oriented Software - Addison-Wesley Professional, New York, NY, USA - 1995

[GM02] R. J. Glushko, T. McGrath? - Document Engineering for e-Business - ACM, Proceedings of the ACM symposium on Document Engineering, pp. 42-48, McLean?, VA, USA - 8-9 November 2002

[Gar05] J. J. Garrett - Ajax: A New Approach to Web Applications - Web article - 18 February 2005

[Got07] T. Gottron - Evaluating content extraction on HTML documents - Vic Grout, editors, Proceedings of the Second International Conference on Internet Technologies and Applications (ITA '07), pp. 123-128, Wrexham, North Wales, UK - 4-7 September 2007

[Gru92] T. R. Gruber - A Translation Approach to Portable Ontology Specifications - Knowledge Systems Laboratory, Technical Report KSL 92-71, Computer Science Department, Stanford University, Stanford, California, USA - 1992

[Gub04] D. Gubellini - Linguaggi di schema per XML e modelli astratti di documenti - Master thesis, Computer Science Department, University of Bologna, Bologna, Italy - 2004

[JLR99] I. Jacobs, A. Le Hors, D. Raggett - HTML 4.01 Specification - W3C Recommendation - 24 December 1999

[Jel05] R. Jelliffe - Schematron Specification - Final Committee Draft - 2005

[KT06] K. Koutroumbas, S. Theodoridis - Pattern Recognition - Academic Press (3rd edition) - 2006

[Kay07] M. Kay - XSL Transformations (XSLT) Version 2.0 - W3C Recommendation - 23 January 2007

[LLY03] X. Li, B. Liu, L. Yi - Eliminating Noisy Information in Web Pages for Data Mining - ACM, The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA - 24-27 August 2003

[MRS02] J. Mason, M. Roach, F. Stentiford, L. Xu - Recent trends in video analysis: a taxonomy of video classification problems - Proceedings of the International Conference on Internet and Multimedia Systems and Applications (IASTED), St. Thomas, Virgin Islands, USA - 18-20 November 2002

[Nel80] T. Nelson - Literary Machines: The report on, and of, Project Xanadu concerning word processing, electronic publishing, hypertext, thinkertoys, tomorrow's intellectual... including knowledge, education and freedom - Mindful Press, Sausalito, CA, USA - 1980

[Nis04] NISO - Understanding Metadata - NISO Press, acaiable on web - 2004

[Oas01] OASIS - RELAX NG Specification - Committee Specification - 2001

[Rij79] C. J. van Rijsbergen - Information Retrieval - Second edition, available on web - 1979

[San06] G. Sanchietti - L'uso di design pattern nella conversione fra formati testuali: una proposta e un progetto - Undergraduate thesis, Computer Science Department, University of Bologna, Bologna, Italy - 2006

[Sun02] Sun - Reflection Guide - Guide - 2002

[Sun03] Sun - JAR File Specification - Specification - 2003

[Sun06a] Sun - Servlet 2.5 Specification - Specification - 2006

[Sun06b] Sun - Java™ Platform, Standard Edition 6 - API Specification - 2006

[Ven03] E. Ventura Campori - Estrazione di informazioni di layout attraverso analisi strutturale nelle pagine HTML - Master thesis, Computer Science Department, University of Bologna, Bologna, Italy - 2003

to top

I Attachment sort Action Size Date Who Comment
nytimesarticlemain.png manage 873.5 K 31 Jan 2008 - 13:58 SilvioPeroni Immagine New York Times Main
pentaformat.png manage 45.3 K 31 Jan 2008 - 14:50 SilvioPeroni Pentaformato
comportamento_pentaformato.png manage 307.0 K 10 Feb 2008 - 01:54 SilvioPeroni  
contenuto_pentaformato.png manage 303.1 K 31 Jan 2008 - 18:15 SilvioPeroni  
presentazione_pentaformato.png manage 311.7 K 31 Jan 2008 - 18:15 SilvioPeroni  
elisacolour.jpg manage 76.7 K 08 Feb 2008 - 13:27 SilvioPeroni  
elisa_3_passi.png manage 128.8 K 10 Feb 2008 - 02:36 SilvioPeroni  
isawikiprocess.png manage 31.5 K 08 Feb 2008 - 13:28 SilvioPeroni  
elisacolour.png manage 567.9 K 08 Feb 2008 - 13:33 SilvioPeroni  
elisa_5_passi.png manage 213.2 K 11 Feb 2008 - 13:04 SilvioPeroni  
google_search.png manage 172.4 K 13 Feb 2008 - 12:04 SilvioPeroni  
microsoft_search.png manage 94.7 K 13 Feb 2008 - 12:32 SilvioPeroni  
yahoo_search.png manage 157.8 K 13 Feb 2008 - 12:33 SilvioPeroni  
wiki_bernerslee.png manage 305.8 K 13 Feb 2008 - 15:29 SilvioPeroni  
layerCake.png manage 51.0 K 14 Feb 2008 - 13:02 SilvioPeroni  
pattern.png manage 135.8 K 21 Feb 2008 - 13:02 SilvioPeroni  
elisaserver.png manage 222.2 K 22 Feb 2008 - 11:16 SilvioPeroni  
serverstep1.png manage 32.3 K 22 Feb 2008 - 13:00 SilvioPeroni  
serverstep2.png manage 48.9 K 22 Feb 2008 - 13:05 SilvioPeroni  
isastar.png manage 191.7 K 24 Feb 2008 - 20:25 SilvioPeroni  
sempatorg.png manage 64.2 K 28 Feb 2008 - 11:03 SilvioPeroni  
sempatsem.png manage 78.8 K 28 Feb 2008 - 11:06 SilvioPeroni  
sempatsint.png manage 84.3 K 28 Feb 2008 - 12:32 SilvioPeroni  

You are here: Tesi > ElISAPentaformatoTesi

to top

Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Fabio's Wiki? Send feedback