Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
…
6 pages
1 file
In this work we define a hybrid Web Content Mining strategy aimed to recognize within Web pages the main entity, intended as the short text that refers directly to the main topic of a given page. The salient aspect of the strategy is the use of a novel supervised Machine Learning model able to represent in an unified framework the integrated use of visual pages layout features, textual features and hyperlink description. The proposed approach has been evaluated with promising results.
Journal of Intelligent Information Systems, 2000
Research in Web mining is moving the World Wide Web toward a more useful environment in which users can quickly and easily find the information they need. Web mining refers to the discovery and analysis of data, documents, and multimedia from the World Wide Web. It includes hyperlink structure, statistical usage, and document content mining. Structure mining is concerned with the discovery of information through the analysis of Web page in and out links. This kind of information can establish the authority of a Web page, and help in page categorization. Usage mining applies data mining techniques to discover patterns in Web logs. This is useful in defining collaboration between users and refining user personal preferences. Content mining extracts concepts from the content of Web pages. Information retrieval techniques are applied to unstructured (text), semi-structured (HTML, XML), and structured (databases) Web pages to extract semantic meaning. This journal issue presents current research in Web content mining of unstructured and semi-structured Web pages. Search engines have the responsibility for extracting semantic meaning from the content of Web pages. So much information is now available that a searcher must depend upon search engines for possible information sources. With Web content as diverse as the authors creating Web pages, the search engine must understand the content of the individual Web pages for a searcher to effectively find information. This is not a trivial task. Authors of unstructured and semi-structured text may not be concerned with the automatic extraction of meaning. Typically text is written for a human audience, which is naturally capable of extracting meaning. To extract semantic meaning requires an understanding of the elements of the Web page and an understanding of the relationships between those elements. The extracted meaning must then be placed in a structure that is easily searchable in response to a query. Basically search engines consist of 3 parts-the user-interface, the spider, and the index. The user-interface is where the searcher enters keywords as a search query. These keywords represent the searcher's information need. Prior to the query, the spider had found pages on the Web. These pages are indexed as keywords, locations, and other descriptive information. The keywords selected represent the concepts expressed in the page. The search is simply a query of the index. The keywords in the query are matched to the keywords in the index, and the more matching keywords the better the page is as an information source. The location (and other information in the index) of the pages with the best matches are returned to the searcher.
2011
The web is recognized as the largest data source in the world. The nature of such data is characterized by partial or no structure, and even worse there exist no standard data schema for the even low-volumed structured data. Web Mining aims to extract useful knowledge from the Web by using a variety of techniques that have to cope with the heterogeneity and lack of a unique and fixed way of representing information. An important aspect in Web Mining is played by the automation of extraction rules with proper algorithms. Machine Learning techniques have been successfully applied to Web Mining and Information Extraction tasks thanks to the generalization and adaptation capabilities that are a key requirement on general content, heterogeneous web pages. The World Wide Web is a graph, more precisely a directed labeled graph where the nodes are represented by the pages and the edges are represented by links between them. Recent works propose the exploitation of the web structure (Link Analysis) for content extraction, for example one can leverage the content category of neighbor pages to categorize the contents of difficult web pages where word-frequency-based techniques are not robust enough. In this thesis we propose an automated method suitable for a wide range of domains based on Machine Learning and Link Analysis. In particular we propose an inductive model able to recognize content pages where structured information is located after being trained with proper input data. In order to keep the recognition speed high enough for real-world applications an additional algorithm is proposed which lets the approach to boost both in speed and quality. The iii proposed method has been tested with controlled dataset in a classic train-and-test scenario and in a real-world web crawling system. with the research group I am part of, namely Prof. Elisabetta Binaghi and Ignazio Gallo. Earlier works on closely related areas, such as Document Clustering [Frakes and Baeza-Yates, 1992], Named Entity Recognition [Nadeau and Sekine, 2007] and Text Categorization [Sebastiani, 2002] have contributed to the development of the proposed approach. Huge thanks to Prof. Fabio Crestani for his contribution to the development of the thesis. This research project has been funded by and contributed by 7Pixel, a company that owns and runs leader price comparison services in Europe (http://www.trovaprezzi.it and http://www.shoppydoo.com are the two major brands). The development of the real-world web crawling system would not have been possible without the continued effort of the 7Pixel R&D Team, in particular Alessandro and Roberto. Thank you.
1999
Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of material on the Web; it is thus becoming necessary to resort to techniques for the automatic classification of documents. Automatic classification i s traditionally performed by extracting the information for representing a document ("indexing") from the document itself. The paper describes the novel technique of categorization by context, which instead extracts useful information for classifying a document from the context where a URL referring to it appears. We present the results of experimenting with Theseus, a classifier that exploits this technique.
Advances in Artificial Intelligence, 2011
Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.
Transactions of The Japanese Society for Artificial Intelligence, 2010
Directory services are popular among people who search their favorite information on the Web. Those services provide hierarchical categories for finding a user's favorite page. Pages on the Web are categorized into one of the categories by hand. Many existing studies classify a web page by using text in the page. Recently, some studies use text not only from a target page which they want to categorize, but also from the original pages which link to the target page. We have to narrow down the text part in the original pages, because they include many text parts that are not related to the target page. However these studies always use a unique extraction method for all pages. Although web pages usually differ so much in their formats, they do not change their extraction methods. We have already developed an extraction method of anchor-related text. We use text parts extracted by our method for classifying web pages. The results of the experiments showed that our extraction method...
Knowledge-Based Systems, 2014
Unsupervised URL-Based Web Page Classication refers to the problem of clustering the URLs in a web site so that each cluster includes a set of pages that can be classied using a unique class. The existing proposals to perform URL-Based Classication suer from a number of drawbacks: they are supervised, which requires the user to provide labelled training data and are then dicult to scale, are language or domain dependent, since they require the user to provide dictionaries of words, or require extensive crawling, which is time and resource consuming. In this article, we propose a new statistical technique to mine URL patterns that are able to classify Web pages. Our proposal is unsupervised, language and domain independent, and does not require extensive crawling. We have evaluated our proposal on 45 real-world web sites, and the results conrm that it can achieve a mean precision of 98% and a mean recall of 91%, and that its performance is comparable to that of a supervised classication technique, while it does not require to label large sets of sample pages. Furthermore, we propose a novel application that helps to extract the underlying model from non-semantic-web sites.
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
Bonfring International Journal of Data Mining, 2016
Nowadays, the Web has become one of the most widespread platforms for information change and retrieval. As it becomes easier to publish documents, as the number of users, and thus publishers, increases and as the number of documents grows, searching for information is turning into a cumbersome and time-consuming operation. Due to heterogeneity and unstructured nature of the data available on the WWW, Web mining uses various data mining techniques to discover useful knowledge from Web hyperlinks, page content and usage log. The main uses of web content mining are to gather, categorize, organize and provide the best possible information available on the Web to the user requesting the information. The mining tools are imperative to scanning the many HTML documents, images, and text. Then, the result is used by the search engines. In this paper, we first introduce the concepts related to web mining; we then present an overview of different Web Content Mining tools. We conclude by presenting a comparative table of these tools based on some pertinent criteria.
Web content extraction is a key technology for enabling an array of applications aimed at understanding the web. While automated web extraction has been studied extensively, they often focus on extracting structured data that appear multiple times on a single webpage, like product catalogs. This project aims to extract less structured web content, like news articles, that appear only once in noisy webpages. Our approach classifies text blocks using a mixture of visual and language independent features. In addition, a pipeline is devised to automatically label datapoints through clustering where each cluster is scored based on its relevance to the webpage description extracted from the meta tags, and dat-apoints in the best cluster are selected as positive training examples.
International Journal of Engineering Research and Technology (IJERT), 2012
https://www.ijert.org/web-content-mining-techniques-a-comprehensive-survey https://www.ijert.org/research/web-content-mining-techniques-a-comprehensive-survey-IJERTV1IS10269.pdf With flooding of information on WWW it has become necessary to apply some strategy so that valuable knowledge can be extracted and consequently returned to the user. Data mining techniques find their applicability in these scenario. Data mining concepts and techniques when applied to WWW with its existing technologies are known as web mining.The paper contains techniques of web content mining, review, various algorithms, examples and comparison. Web mining is one of the well-known technique in data mining and it could be done in three different ways (a)web usage mining, (b)web structure mining and (c)web content mining. Web usage mining allows for collection of web access information for web pages. Web content mining is the scanning and mining of text, pictures and graphs of web page to determine relevance of content to the search query. Web structure mining is used to identify the relationship between the web pages linked by information.The paper presents various examples based on web content mining techniques in detail, results and comparison to extract necessary information effectively and efficiently.
American Journal of International Law, 1986
Pediatric & health research, 2018
Paths from the Philosophy of Art to Everyday Aesthetics, 2019
Dinamika Teknik Sipil Majalah Ilmiah Teknik Sipil, 2023
Jurnal Teruna Bhakti, 2024
HUMANITAS - Uluslararası Sosyal Bilimler Dergisi
2014 2nd International Conference on Information and Communication Technology (ICoICT), 2014
ELT Journal, 1988
International Journal of Greenhouse Gas Control, 2018
Energy and Sustainability III, 2011
Mechanisms of Ageing and Development, 2001