Principles of Data Integration

Data integration is the problem of answering queries that span multiple data sources (e.g., databases, web pages). Data integration problems surface in multiple contexts, including enterprise information integration, query processing on the Web, coordination between government agencies and collaboration between scientists. In some cases, data integration is the key bottleneck to making progress in a field. For example, when two companies merge, the number of different databases scattered across a company could easily reach 100. Obtaining a complete and organized view of data requires the application of data integration technology (i.e., semantic integration involves resolving the inevitable differences in certain concepts and definitions in their respective schemas, like "earnings," "compliant," etc. This book presents a comprehensive treatment of the issues faced in integrating data from multiple sources, from the theoretical principles to system issues and current challenges raised by the World Wide Web and cloud computing. It allows readers to answer the constantly recurring question: How do I approach answering queries when my data is stored in multiple databases that were designed independently by different people?