Given the following three items where w2, w4, etc. all represent different words. Compare how well the shingle process works in determining which items are near duplicates by looking at the shingles composed of 3 words versus shingles composed of 6 words. Use the rolling definition of shingles where for example the first 3 words are shingle 1, then word 2-4 are shingle 2, 3-5 are shingle 3 until the last 3 words are the last shingle when creating the three word process. To determine the numeric value for each shingle just take the word number to make a number. Thus for shingle w1w1w4 the numeric value would be 114. For shingle w1w1w4w2w2w1 the number would be 114221. Use Borders formula to calculate the resemblance between each item and the other items for the 3 word shingle and the 6 word shingles. Discuss the results and the impact of going to 6 word shingles.
Item 1: w1 w1 w4 w2 w2 w1 w4 w2 w3 w1 w1 w4 w2 w3 w4 w3
Item 2: w1 w4 w2 w4 w1 w1 w4 w2 w2 w1 w2 w3 w3 w2 w2 w4
Item 3: w1 w1 w4 w2 w2 w1 w4 w2 w3 w1 w1 w4 w2 w5 w4 w3