Text Joins for Data Cleansing and Integration
Check the online
SQL Scripts
The SQL scripts described in [2,3]
- Import data
- Create tokens
- Create auxiliary relations
- Sampling
- Run the join
- Measure precision and recall
(The SQL scripts were tested on Microsoft SQL Server 2000,
Developer's edition, Service Pack 2.)
Duplicate Record Detection: A Survey, (wiki
feel free to contribute)
A. Elmagarmid, P. Ipeirotis, and V. Verykios,
IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 19, no.
1, January 2007 -
Joins in an RDBMS for Web Data Integration,
L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava,
Proceedings of the 12th International
World-Wide Web Conference (WWW2003), 2003
Joins for Data Cleansing and Integration in an RDBMS, (poster)
L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava,
Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE
2003), 2003
Approximate String Joins in a
Database (Almost) for Free, (erratum)
L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and
D. Srivastava
Proceedings of the 27th International Conference on Very Large Databases (VLDB
2001), 2001
Using q-grams in a DBMS for
Approximate String Processing, (erratum)
L. Gravano, P. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen,
and D. Srivastava,
IEEE Data Engineering Bulletin, vol. 24, no. 4, December 2001.