Text Joins for Data Cleansing and Integration

Demo

Check the online demo.

SQL Scripts

The SQL scripts described in [2,3]

  1. Import data
  2. Create tokens
  3. Create auxiliary relations
  4. Sampling
  5. Run the join
  6. Measure precision and recall

(The SQL scripts were tested on Microsoft SQL Server 2000, Developer's edition, Service Pack 2.)

Papers

  1. Duplicate Record Detection: A Survey, (wiki --- feel free to contribute)
    A. Elmagarmid, P. Ipeirotis, and V. Verykios,
    IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 19, no. 1, January 2007
  2. Text Joins in an RDBMS for Web Data Integration,
    L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava,
    Proceedings of  the 12th International World-Wide Web Conference (WWW2003), 2003
  3. Text Joins for Data Cleansing and Integration in an RDBMS, (poster)
    L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava,
    Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE 2003), 2003
  4. Approximate String Joins in a Database (Almost) for Free, (erratum)
    L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava
    Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001), 2001
  5. Using q-grams in a DBMS for Approximate String Processing, (erratum)
    L. Gravano, P. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava,

    IEEE Data Engineering Bulletin, vol. 24, no. 4, December 2001.