skip to main content
research-article

Overview and Framework for Data and Information Quality Research

Published: 01 June 2009 Publication History

Abstract

Awareness of data and information quality issues has grown rapidly in light of the critical role played by the quality of information in our data-intensive, knowledge-based economy. Research in the past two decades has produced a large body of data quality knowledge and has expanded our ability to solve many data and information quality problems. In this article, we present an overview of the evolution and current landscape of data and information quality research. We introduce a framework to characterize the research along two dimensions: topics and methods. Representative papers are cited for purposes of illustrating the issues addressed and the methods used. We also identify and discuss challenges to be addressed in future research.

References

[1]
Abdel-Hamid, T. K. 1988. The economics of software quality assurance: A simulation-based case study. MIS Quart. 12, 3, 395--411.
[2]
Abdel-Hamid, T. K. and Madnick, S. E. 1990. Dynamics of Software Project Management. Prentice-Hall, Englewood Cliffs, NJ.
[3]
Ang, W. H., Lee, Y. W., Madnick, S. E., Mistress, D., Siegel, M., Strong, D. M., Wang, R. Y., and Yao, C. 2006. House of security: Locale, roles, resources for ensuring information security. In Proceedings of the 12th Americas Conference on Information Systems.
[4]
Ballou, D. P., Chengalur-Smith, I. N., and Wang, R. Y. 2006. Sample-Based quality estimation of query results in relational database environments. IEEE Trans. Knowl. Data Eng. 18, 5, 639--650.
[5]
Ballou, D. and Pazer, H. 1995. Designing information systems to optimize accuracy-timeliness trade-off. Inf. Syst. Res. 6, 1, 51--72.
[6]
Ballou, D. and Tayi, G. K. 1999. Enhancing data quality in data warehouse environments. Commun. ACM 41, 1, 73--78.
[7]
Ballou, D., Wang, R. Y., Pazer, H., and Tayi, G. K. 1998. Modeling information manufacturing systems to determine information product quality. Manag. Sci. 44, 4, 462--484.
[8]
Baskerville, R. and Wood-Harper, A. T. 1996. A critical perspective on action research as a method for information systems research. J. Inf. Technol. 11, 235--246.
[9]
Batini, C., Lenzerini, M., and Navathe, S. 1986. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 4, 323--364.
[10]
Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies, and Techniques. Springer Verlag.
[11]
Benjelloun, O., Das Sarma, A., Halevy, A., and Widom, J. 2006. ULDBs: Databases with uncertainty and lineage. In Proceedings of the 32nd VLDB Conference, 935--964.
[12]
Bovee, M., Ettredge, M. L., Srivastava, R. P., and Vasarhelyi, M. A. 2002. Does the year 2000 XBRL taxonomy accommodate current business financial-reporting practice? J. Inf. Syst. 16, 2, 165--182.
[13]
Buneman, P., Chapman, A., and Cheney, J. 2006. Provenance management in curated databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, 539--550.
[14]
Buneman, P., Khanna, S., and Tan, W. C. 2001. Why and where: A characterization of data provenance. In International Conference on Database Theory, J. Van den Bussche and V. Vianu, Eds. Lecture Notes in Computer Science, vol. 1973. Springer, 316--330.
[15]
Chen, P. P. 1976. The entity-relationship model: Toward a unified view of data. ACM Trans. Database Syst. 1, 1, 1--36.
[16]
Chengular-Smith, I., Ballou, D. P., and Pazer, H. L. 1999. The impact of data quality information on decision making: An exploratory analysis. IEEE Trans. Knowl. Data Eng. 11, 6, 853--865.
[17]
Dalvi, N. and Suciu, D. 2007. Management of probabilistic data: Foundations and challenges. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 1--12.
[18]
Dasgupta, P. and Stiglitz, J. 1980. Uncertainty, industrial structure, and the speed of R&D. The Bell J. Econom. 11, 1, 1--28.
[19]
Dasu, T. and Johnson, T. 2003. Exploratory Data Minding and Data Cleaning. John Wiley & Sons, Hoboken, NJ.
[20]
Davidson, B., Lee, Y. W., and Wang, R. 2004. Developing data production maps: Meeting patient discharge data 1.
[21]
De Vany, S. and Saving, T. 1983. The economics of quality. The J. Political Econ. 91, 6, 979--1000.
[22]
Deming, W. E. 1982. Out of the Crisis. MIT Press, Cambridge, MA.
[23]
Doan, A., Domingos, P., and Halevy, A. 2001. Reconciling schemas of disparate data sources: A machine learning approach. In Proceedings of the ACM SIGMOD Conference, 509--520.
[24]
Doan, A. and Halevy, A. Y. 2005. Semantic-Integration research in the database community: A brief survey. AI Mag. 26, 1, 83--94.
[25]
Fagin, R., Kolaitis, P. G., Miller, R., and Popa, L. 2005. Data exchange: Semantics and query answering. Theoretical Comput. Sci. 336, 1, 89--124.
[26]
Fan, W., Lu, H., Madnick, S. E., and Cheung, D. W. 2001. Discovering and reconciling data value conflicts for numerical data integration. Inf. Syst. 26, 8, 635--656.
[27]
Fisher, C., Chengular-Smith, I., and Ballou, D. 2003. The impact of experience and time on the use of data quality information in decision making. Inf. Syst. Res. 14, 2, 170--188.
[28]
Fisher, C. and Kingma, B. 2001. Criticality of data quality as exemplified in two disasters. Inf. Manag. 39, 109--116.
[29]
Flyvbjerg, B. 2006. Five misunderstandings about case study research. Qualitative Inquiry 12, 2, 219--245.
[30]
Frawley, W. J., Piateksky-Shapiro, G., and Matheu S, C. J. 1992. Knowledge discovery in databases: An overview. AI Mag. 13, 3, 57--70.
[31]
Galahards, H., Florescu, D., Shasha, D., Simon, E., and Saita, C. A. 2001. Declarative data cleaning: Language, model and algorithms. In Proceedings of the 27th VLDB Conference, 371--380.
[32]
Goh, C. H., Bressan, S., Madnick, S. E., and Siegel, M. D. 1999. Context interchange: New features and formalisms for the intelligent integration of information. ACM Trans. Inf. Syst. 17, 3, 270--293
[33]
He, B., Chang, K. C. C., and Han, J. 2004. Mining complex matchings across Web query interfaces. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 3--10.
[34]
Herbert, K. G., Gehani, N. H., Piel, W. H., Wang, J. T. L., and Wu, C. H. 2004. BIO-AJAX: An extensible framework for biological data cleaning. SIGMOD Rec. 33, 2, 51--57.
[35]
Hernandez, M. and Stolfo. 1998. Real-World data is dirty: Data cleansing and the merge/purge problem. J. Data Mining Knowl. Discov. 2, 1, 9--37.
[36]
Hevner, A. T., March, S. T., Park, J., and Ram, S. 2004. Design science in information systems research. MIS Quart. 28, 1, 75--105.
[37]
Jarke, M., Jeusfeld, M. A., Quix, C., and Vassiliadis, P. 1999. Architecture and quality in data warehouse: An extended repository approach. Inf. Syst. 24, 3, 229--253.
[38]
Jung, W., Olfman, L., Ryan, T., and Park, Y. 2005. An experimental study of the effects of contextual data quality and task complexity on decision performance. In Proceedings of the IEEE International Conference on Information Reuse and Integration, 149--154.
[39]
Juran, J. and Goferey, A. B. 1999. Juran’s Quality Handbook. 5th ed. McGraw-Hill, New York.
[40]
Kaomea, P. and Page, W. 1997. A flexible information manufacturing system for the generation of tailored information products. Decision Support Syst. 20, 4, 345--355.
[41]
Kerr, K. 2006. The institutionalization of data quality in the New Zealand health sector. Ph.D. dissertation, The University of Auckland, New Zealand.
[42]
Klein, B. D. and Rossin, D. F. 1999. Data quality in neural network models: Effect of error rate and magnitude of error on predictive accuracy. Omega 27, 5, 569--582.
[43]
Lee, Y. W. 2004. Crafting rules: Context-reflective data quality problem solving. J. Manag. Inf. Syst. 20, 3, 93--119.
[44]
Lee, Y. W., Chase, S., Fisher, J., Leinung, A., McDowell, D., Paradiso, M., Simons, J., and Yarawich, C. 2007a. CEIP maps: Context-Embedded information product maps. In Proceedings of Americas’ Conference on Information Systems.
[45]
Lee, Y. W., Pierce, E., Talburt, J., Wang, R. Y., and Zhu, H. 2007b. A curriculum for a master of science in information quality. J. Inf. Syst. Educ. 18, 2.
[46]
Lee, Y. W., Pipino, L. L., Fund, J. F., and Wang, R. Y. 2006. Journey to Data Quality. The MIT Press, Cambridge, MA.
[47]
Lee, Y. W., Pipino, L., Strong, D., and Wang, R. 2004. Process embedded data integrity. J. Database Manag. 15, 1, 87--103.
[48]
Lee, Y. and Strong, D. 2004. Knowing-Why about data processes and data quality. J. Manag. Inf. Syst. 20, 3, 13--39.
[49]
Lee, Y., Strong, D., Kahn, B., and Wang, R. 2002. AIMQ: A methodology for information quality assessment. Inf. Manag. 40, 133--146.
[50]
Li, X. B. and Sarkar, S. 2006. Privacy protection in data mining: A perturbation approach for categorical data. Inf. Syst. Res. 17, 3, 254--270.
[51]
Madnick, S. and Prat, N. 2008. Measuring data believability: A provenance approach. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences.
[52]
Madnick, S. and Wang, R. Y. 1992. Introduction to total data quality management (TDQM) research program. TDQM-92-01, Total Data Quality Management Program, MIT Sloan School of Management.
[53]
Madnick, S. E., Wang, R. Y., Dravis, F., and Chen, X. 2001. Improving the quality of corporate household data: Current practices and research directions. In Proceedings of the 6th International Conference on Information Quality, 92--104
[54]
Madnick, S. E., Wang, R. Y., Krishna, C., Dravis, F., Funk, J., Katz-Hass, R., Lee, C., Lee, Y., Xiam, X., and Bhansali, S. 2005. Exemplifying business opportunities for improving data quality from corporate household research. In Information Quality. R. Y. Wang et al., Eds. M. E. Sharpe, Armonk, NY, 181--196.
[55]
Madnick, S. E., Wang, R. Y., and Xian, X. 2004. The design and implementation of a corporate householding knowledge processor to improve data quality. J. Manag. Inf. Syst. 20, 3, 41--69.
[56]
Madnick, S. E. and Zhu, H. 2006. Improving data quality with effective use of data semantics. Data Knowl. Eng. 59, 2, 460--475.
[57]
Marco, D., Duate-Melo, E., Liu, M., and Neuhoffand, D. 2003. On the many-to-one transport capacity of a dense wireless sensor network and the compressibility of its data. In Information Processing in Sensor Networks. In Goos et al., Eds. Lecture Notes in Computer Science, vol. 2634, Springer Berlin, 556.
[58]
Mikkelsen, G. and Aasly, J. 2005. Consequences of impaired data quality on information retrieval in electronic patient records. Int. J. Med. Inf. 74, 5, 387--394.
[59]
Myers, M. D. 1997. Qualitative research in information systems. http://www.misq.org/discovery/MISQD_isworld/index.html (retrieved on October 5, 2007).
[60]
O’Callaghan, L., Mishira, N., Meyerson, A., Guha, S., and Motwaniha, R. 2002. In Proceedings of the 18th International Conference on Data and Engineering, 685--694.
[61]
OMB (Office of Management & Budget). 2007. FEA reference models. http://www.whitehouse.gov/omb/egov/a-2-EAModelsNEW2.html. (retrieved on October 5, 2007).
[62]
Øvretveit, J. 2000. The economics of quality -- A practical approach. Int. J. Health Care Quality Assurance 13, 5, 200--207.
[63]
Petrovskiy, M. I. 2003. Outlier detection algorithms in data mining systems. Program. Comput. Softw. 29, 4, 228--237.
[64]
Pierce, E. M. 2004. Assessing data quality with control matrices. Commun. ACM 47, 2, 82--86.
[65]
Pipino, L., Lee, Y., and Wang, R. 2002. Data quality assessment. Commun. ACM 45, 4, 211--218.
[66]
Raghunathan, S. 1999. Impact of information quality and decision-making quality on decision quality: A theoretical model. Decision Support Syst. 25, 4, 275--287.
[67]
Rahm, E. and Bernstein, P. 2001. On matching schemas automatically. VLDB J. 10, 4, 334--350.
[68]
Redman, T. C. 1998. The impact of poor data quality on the typical enterprise. Commun. ACM 41, 2, 79--82.
[69]
Schekkerman, J. 2004. How to Survive in the Jungle of Enterprise Architecture Frameworks: Creating or Choosing an Enterprise Architecture Framework. Trafford Publishing.
[70]
Shankaranarayan, G., Ziad, M., and Wang, R. Y. 2003. Managing data quality in dynamic decision environment: An information product approach. J. Database Manag. 14, 4, 14--32.
[71]
Sheng, Y. and Mykytyn, P. 2002. Information technology investment and firm performance: A perspective of data quality. In Proceedings of the 7th International Conference on Information Quality, 132--141.
[72]
Slone, J. P. 2006. Information quality strategy: An empirical investigation of the relationship between information quality improvements and organizational outcomes. Ph.D. dissertation, Capella University.
[73]
Storey, V. and Wang, R. Y. 1998. Modeling quality requirements in conceptual database design. In Proceedings of the International Conference on Information Quality, 64--87
[74]
Strong, D., Lee, Y. W., and Wang, R. Y. 1997. Data quality in context. Commun. ACM 40, 5, 103--110.
[75]
Talburt, J., Morgan, C., Talley, T., and Archer, K. 2005. Using commercial data integration technologies to improve the quality of anonymous entity resolution in the public sector. In Proceedings of the 10th International Conference on Information Quality (ICIQ’05), 133--142.
[76]
Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules from information extraction. Inf. Syst. 26, 8, 607--633.
[77]
Thatcher, M. E. and Pingry, D. E. 2004. An economic model of product quality and IT value. Inf. Syst. Res. 15, 3, 268--286.
[78]
Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., and Sellis, T. 2001. ARKTOS: Towards the modeling, design, control and execution of ETL processes. Inf. Syst. 26, 537--561.
[79]
Wang, R. Y., Kon, H. B., and Madnick, S. E. 1993. Data quality requirements analysis and modeling. In Proceedings of the 9th International Conference of Data Engineering, 670--677.
[80]
Wang, R. Y., Lee, Y., Pipino, L., and Strong, D. 1998. Managing your information as a product. Sloan Manag. Rev. Summer 1998, 95--106.
[81]
Wang, R. Y. and Madnick, S. E. 1989. The inter-database instance identification problem in integrating autonomous systems. In Proceedings of the 5th International Conference on Data Engineering, 46--55.
[82]
Wang, R. Y. and Madnick, S. E. 1990. A polygen model for heterogeneous database systems: The source tagging perspective. In Proceedings of the 16th VLDB Conference, 519--538.
[83]
Wang, R. Y., Reddy, M., and Kon, H. 1995a. Toward quality data: An attribute-based approach. Decision Support Syst. 13, 349--372.
[84]
Wang, R. Y., Storey, V. C., and Firth, C. P. 1995b. A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7, 4, 623--640.
[85]
Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4, 5--34.
[86]
Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR’05).
[87]
Winkler, W. E. 2006. Overview of record linkage and current research directions. Tech. rep. U.S. Census Bureau, Statistics #2006-2.
[88]
Xiao, X. and Tao, Y. 2006. Anatomy: Simple and effective privacy preservation. In Proceedings of the 32nd VLDB Conference.
[89]
Xu H., Nord, J. H., Brown, N., and Nord, G. G. 2002. Data quality issues in implementing an ERP. Industrial Manag. Data Syst. 102, 1, 47--58.
[90]
Yin, R. 2002. Case Study Research: Design and Methods, 3rd ed. Sage Publications, Thousand Oaks, CA.
[91]
Zachman, J. A. 1987. A framework for information systems architecture. IBM Syst. J. 26, 3, 276--292.
[92]
Zhu, X., Khoshgoftaar, T., Davidson, I., and Zhang, S. 2007. Editorial: Special issue on mining low-quality data. Knowl. Inf. Syst. 11, 2, 131--136.

Cited By

View all
  • (2024)Enterprise Data Governance : A Comprehensive Framework for Ensuring Data Integrity, Security, and Compliance in Modern OrganizationsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT24105106210:5(731-743)Online publication date: 21-May-2024
  • (2024)Challenges of Data Quality in Clinical Data Life Cycle: A Systematic Review (Preprint)Journal of Medical Internet Research10.2196/60709Online publication date: 19-May-2024
  • (2024)Resilient Artificial Intelligence in Health: Synthesis and Research Agenda Toward Next-Generation Trustworthy Clinical Decision SupportJournal of Medical Internet Research10.2196/5029526(e50295)Online publication date: 28-Jun-2024
  • Show More Cited By

Index Terms

  1. Overview and Framework for Data and Information Quality Research

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Journal of Data and Information Quality
      Journal of Data and Information Quality  Volume 1, Issue 1
      June 2009
      94 pages
      ISSN:1936-1955
      EISSN:1936-1963
      DOI:10.1145/1515693
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 June 2009
      Accepted: 01 March 2009
      Revised: 01 February 2009
      Received: 01 February 2008
      Published in JDIQ Volume 1, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Data quality
      2. information quality
      3. research methods
      4. research topics

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)419
      • Downloads (Last 6 weeks)26
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Enterprise Data Governance : A Comprehensive Framework for Ensuring Data Integrity, Security, and Compliance in Modern OrganizationsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT24105106210:5(731-743)Online publication date: 21-May-2024
      • (2024)Challenges of Data Quality in Clinical Data Life Cycle: A Systematic Review (Preprint)Journal of Medical Internet Research10.2196/60709Online publication date: 19-May-2024
      • (2024)Resilient Artificial Intelligence in Health: Synthesis and Research Agenda Toward Next-Generation Trustworthy Clinical Decision SupportJournal of Medical Internet Research10.2196/5029526(e50295)Online publication date: 28-Jun-2024
      • (2024)Artificial Intelligence in HealthcareGreen Industrial Applications of Artificial Intelligence and Internet of Things10.2174/9789815223255124010007(46-60)Online publication date: 21-Jul-2024
      • (2024)Menilai Integritas: Kajian Kualitas Informasi Video Ulasan GLAM di Platform Media Sosial Instagram, TikTok, dan YoutubePalimpsest: Jurnal Ilmu Informasi dan Perpustakaan10.20473/pjil.v15i1.5843015:1(56-68)Online publication date: 27-Jun-2024
      • (2024)Veracity Estimation for Entity-Oriented Search with Knowledge GraphsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679561(1649-1659)Online publication date: 21-Oct-2024
      • (2024)Exploring the Impact of Live-streaming Shopping On Chinese Customers’ Experience in the Domestic Cosmetic Industry2024 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON)10.1109/ECTIDAMTNCON60518.2024.10480064(469-474)Online publication date: 31-Jan-2024
      • (2024)Workshop digitalisation and its effect on manufacturing operations management: a new analysis method from a data quality perspectiveEnterprise Information Systems10.1080/17517575.2024.233007618:5Online publication date: 29-Mar-2024
      • (2024)A design theory for data quality tools in data ecosystemsData & Knowledge Engineering10.1016/j.datak.2024.102333153:COnline publication date: 21-Nov-2024
      • (2024)Beispiele für die erfolgreiche Anwendung in verschiedenen BranchenKontinuierliche Verbesserung von Organisationen: verfahrenstechnischer und kultureller Ansatz10.1007/978-3-658-42278-3_4(109-149)Online publication date: 1-Sep-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media