Abstract
We present a simple method to extract information from search engine snippets. Although the techniques presented are domain independent, this work focuses on extracting biographical information of historical persons from multiple unstructured sources on the Web. We first similarly find a list of persons and their periods of life by querying the periods and scanning the retrieved snippets for person names. Subsequently, we find biographical information for the persons extracted. In order to get insight in the mutual relations among the persons identified, we create a social network using co-occurrences on the Web. Although we use uncontrolled and unstructured Web sources, the information extracted is reliable. Moreover we show that Web Information Extraction can be used to create both informative and enjoyable applications.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
McDowell, L., Cafarella, M.J.: Ontology-driven information extraction with ontosyphon. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 428–444. Springer, Heidelberg (2006)
Etzioni, O., Cafarella, M.J., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence 165(1), 91–134 (2005)
van Hage, W.R., Kolb, H., Schreiber, G.: A method for learning part-whole relations. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 723–736. Springer, Heidelberg (2006)
Geleijnse, G., Korst, J.: Learning effective surface text patterns for information extraction. In: ATEM 2006. Proceedings of the EACL 2006 workshop on Adaptive Text Extraction and Mining, Trento, Italy, pp. 1–8 (2006)
Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics, Nantes, France, pp. 539–545 (1992)
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. Journal of the ACM 51(5), 731–779 (2004)
Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI 2005. Proceeding of the 19th International Joint Conference on Artificial Intelligence, Edinburgh, UK pp. 1034–1041 (2005)
Downey, D., Broadhead, M., Etzioni, O.: Locating Complex Named Entities in Web Text. In: IJCAI 2007. Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, Hyderabad, India (2007)
Sumida, A., Torisawa, K., Shinzato, K.: Concept-instance relation extraction from simple noun sequences using a full-text search engine. In: WebConMine. Proceedings of the ISWC 2006 workshop on Web Content Mining with Human Language Technologies, Athens, GA (2006)
Cimiano, P., Staab, S.: Learning by Googling. SIGKDD Explorations Newsletter 6(2), 24–33 (2004)
Ravichandran, D., Hovy, E.: Learning surface text patterns for a question answering system. In: ACL 2002. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 41–47 (2002)
Cilibrasi, R., Vitanyi, P.: Automatic meaning discovery using Google (2004), http://www.cwi.nl/~paulv/papers/amdug.pdf
Zadel, M., Fujinaga, I.: Web services for music information retrieval. In: ISMIR 2004. Proceedings of 5th International Conference on Music Information Retrieval, Barcelona, Spain (2004)
Véronis, J.: Weblog (2006), http://aixtal.blogspot.com
Geleijnse, G., Korst, J., de Boer, V.: Instance classification using co-occurrences on the web. In: WebConMine. Proceedings of the ISWC 2006 workshop on Web Content Mining with Human Language Technologies, Athens, GA (2006), http://orestes.ii.uam.es/workshop/3.pdf
Mori, J., Tsujishita, T., Matsuo, Y., Ishizuka, M.: Extracting relations in social networks from the web using similarity between collective contexts. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 487–500. Springer, Heidelberg (2006)
Jin, Y., Matsuo, Y., Ishizuka, M.: Extracting a social network among entities by web mining. In: WebConMine. Proceedings of the ISWC 2006 workshop on Web Content Mining with Human Language Technologies, Athens, GA (2006)
Zhou, G., Su, J.: Named entity recognition using an hmm-based chunk tagger. In: ACL 2002. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 473–480 (2002)
Brothwick, A.: A Maximum Entropy Approach to Named Entity Recognition. PhD thesis, New York University (1999)
Finkel, J.R., Grenager, T., Manning, C.D.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor,MI (2005)
Korst, J., Geleijnse, G., de Jong, N., Verschoor, M.: Ontology-based extraction of information from the World Wide Web. In: Intelligent Algorithms in Ambient and Biomedical Computing, pp. 149–167. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Geleijnse, G., Korst, J. (2007). Creating a Dead Poets Society: Extracting a Social Network of Historical Persons from the Web. In: Aberer, K., et al. The Semantic Web. ISWC ASWC 2007 2007. Lecture Notes in Computer Science, vol 4825. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76298-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-540-76298-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76297-3
Online ISBN: 978-3-540-76298-0
eBook Packages: Computer ScienceComputer Science (R0)