Sampling

Sampling Step

Cursor-based

The "cursor" implementation of sampling is the Java application sampleRelation.java. You have to modify the source code slightly so that it will use your own SQL server, with the appropriate username and password.

"Deterministic"

The "deterministic" version of the sampling is executed using the SQL script sampleRelations.sql.

You can run the script as isql -U <user> -P <passwd> -d <database name> -i sampleRelations.sql

NOTE: This script creates the relations R1Sample and R2Sample with an extra column S (not described in the paper) that corresponds to different sample sizes S. The SQL script creates sample sizes S=1, 2, 4, ... 256. It is trivial to adapt it to use different sample sizes.

DROP TABLE [dbo].[R1Sample]

CREATE TABLE [dbo].[R1Sample] (

[tid] int NOT NULL,

[token] varchar (80) NOT NULL,

[c] int NOT NULL,

[S] int NOT NULL,

PRIMARY KEY (S,tid,token),

FOREIGN KEY (tid) REFERENCES R1,

FOREIGN KEY (token) REFERENCES R1IDF

)



DROP TABLE [dbo].[R2Sample]

CREATE TABLE [dbo].[R2Sample] (

[tid] int NOT NULL,

[token] varchar (80) NOT NULL,

[c] int NOT NULL,

[S] int NOT NULL,

PRIMARY KEY (S,tid,token),

FOREIGN KEY (tid) REFERENCES R2,

FOREIGN KEY (token) REFERENCES R2IDF

)



DECLARE @S int

DECLARE @I int

DECLARE @UPPERLIMIT int -- The upper limit in the sample size will be 2^@UPPERLIMIT

SET @I=0

SET @UPPERLIMIT=8 -- So we will create sample sizes S = 1, 2, 4, ... 256



WHILE @I <= @UPPERLIMIT

	BEGIN

	SET @S=POWER(2, @I)



	INSERT INTO R1Sample(tid, token, c, S)

	SELECT rw.tid AS tid, rw.token AS token, ROUND( (rw.weight/rs.total) * @S, 0 ) AS c, @S AS S

	FROM R1Weights rw, R1Sum rs

	WHERE rw.token = rs.token AND ROUND( (rw.weight/rs.total) * @S, 0 )>0



	INSERT INTO R2Sample(tid, token, c, S)

	SELECT rw.tid AS tid, rw.token AS token, ROUND( (rw.weight/rs.total) * @S, 0 ) AS c, @S AS S

	FROM R2Weights rw, R2Sum rs

	WHERE rw.token = rs.token AND ROUND( (rw.weight/rs.total) * @S, 0 )>0



	SET @I = @I+1

	END

GO