MINOTAVROS: A tool for the semi-automated creation of large corpora from the Web

Ilias Koutsis, George Kouklakis, Georgios Mikros, George Markopoulos

Research output: Contribution to conferencePaperpeer-review

Abstract

During the last decade the rapid web growth has resulted in a vast amount of electronic texts readily available to the interested linguist. Nowadays, the web can be used as a gigantic text collection where anyone can find thousands of texts in most of the languages of the world. Although, in terms of strict representativeness, the Web cannot offer every kind of textual type (Ide et al. 2002), it still remains the most efficient way to easily create corpora. As Kilgariff & Grefenstette (2003: 343) point out “the Web is not representative of anything else. But neither are other corpora, in any well-understood sense”.The Internet in Greece has increased rapidly during the last years experiencing71,8% user growth in the period from 2000 to 2004 1. Modern Greek language has a remarkable appearance in the web counting 2 million web pages already and growing(approx. 0,1% of the total web pages harvested by Google)2.On the other hand there aren’t many corpora of Modern Greek language readily available3 and many linguists who wish to refer to corpus evidence have to build their own small corpora from the Greek Web.
Original languageEnglish
Publication statusPublished - 2005
Externally publishedYes
EventCorpus Linguistics Conference 2005 - Birmingham, United Kingdom
Duration: 14 Jul 200517 Jul 2005

Conference

ConferenceCorpus Linguistics Conference 2005
Country/TerritoryUnited Kingdom
CityBirmingham
Period14/07/0517/07/05

Fingerprint

Dive into the research topics of 'MINOTAVROS: A tool for the semi-automated creation of large corpora from the Web'. Together they form a unique fingerprint.

Cite this