The SHARC framework for data quality in Web archiving
The publication of “The SHARC framework for data quality in Web archiving”, co-written by D. Denev, A. Mazeika, M. Spaniol and G. Weikum, to the VLDB Journal 2011 (Impact factor: 4.517 (2009) has been accepted.
The download is available to download via online first in the VLDB Journal.
Archiving Data Objects using Web Feeds
The paper entitled “Archiving Data Objects using Web Feeds” by M. Oita and P. Senellart has been accepted for presentation at IWAW 2010
Web feeds, either in RSS or Atom XML-based formats, are evolving descriptive documents that characterize a dynamic hub of a Web site and help subscribers keep up with what is the most recent Web content of interest. This paper shows how Web feeds can be useful instruments for information extraction and Web page change detection. Web pages referenced by feed items are usually blog posts or news articles, data with a dynamic (then ephemeral) nature and which is clustered topically in a feed channel.
IWAW 2009
IWAW09 took take place the 30th of September and 1st of October 2009, in conjunction with ECDL in Corfu (Greece). The proceedings are now available online.
Around 40 participants attended IWAW2009, which took place on Sep. 30 / Oct. 1 2009, in conjunction with ECDL in Corfu (Greece). The workshop provided a comprehensive overview on active research and practice on the preservation of the Web. This year’s workshop also addressed several new approaches and research (from virtual worlds preservation to temporal dimension of Web Archives) as well as practical issues addressed by Archiving institutions, specifically with respect to managing the storage of large volumes of digital material. In this context, a special Session was devoted to the WARC storage format, which has been accepted as a new ISO standard (ISO 28500:2009), as well as emerging tool support to handle these container objects. In general, scalability issues and managing large-volume crawls were topics of intensive discussions, based on the increasing body of experience available in numerous institutions by now, running a series of Web archiving activities in a range of different configurations.
Liwa Architecture
Presented at IWAW 08 by Radu Pop, Wolf Siberski, Mark Williamson
See presentation here
The Challenge of Dynamic Links
Presented at IWAW 08 by Mark Williamson
Presented at IWAW 08 by Mark Williamson
See presentation here
Web Spam: a Survey with Vision for the Archivist
Presented at IWAW 08 by Andras Benczur, David Siklosi, Jacint Szabo, Istvan Biro, Zsolt Fekete, Miklos Kurucz, Attila Pereszlenyi, Simon Racz, Adrienn Szabo
While Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam filtering technologies are rarely used by Web archivists. In this paper we make the first attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the state of the art in Web spam filtering illustrated by the recent Web spam challenge data sets and techniques and describe the filtering solution for archives envisioned in the LiWA project.
See paper here
