Recently I had a scenario to compare two different data sets to find the matched records based on a text fields. The challenge is cant do it exact matching.
For Example
DataSet1 contains a record with field as
'Name : ABC Company'
DataSet2 may contains records as
Name : 'ABC Company Ltd.'
So decided to find the match based on ranking system.
Take every record in CUSTOMER1 and compare with CUSTOMER2 records and put score for the match. If the score is greater than 0.75 include CUSTOMER2 record to CUSTOMER1 record with score. Then get the highest match scored record.
Step 1.
Imported both data sets to MongoDB as two different collection, customer1 and customer2.
Step 2.
Since we are going to use aggregate function with $text option, it is necessary to create TextIndex for CUSTOMER2 collection.
Below script will compare & score every record in CUSTOMER1 with CUSTOMER2 records using aggregate function.Score greater than 0.75 records are stored in CUSTOMER1 record item as match property.
Step 4.
Now the step is take the maximum score from match records
For Example
DataSet1 contains a record with field as
'Name : ABC Company'
DataSet2 may contains records as
Name : 'ABC Company Ltd.'
So decided to find the match based on ranking system.
Take every record in CUSTOMER1 and compare with CUSTOMER2 records and put score for the match. If the score is greater than 0.75 include CUSTOMER2 record to CUSTOMER1 record with score. Then get the highest match scored record.
Step 1.
Imported both data sets to MongoDB as two different collection, customer1 and customer2.
CUSTOMER1 data Sample { "CUSTOMER1NAME": "ABC COMPANY", "CUSTOMER1ID": "CRM10003" } CUSTOMER2 data sample { "CUSTOMER2NAME": "ABC COMPANY LTD.", "CUSTOMER2ID": "ORAXYZ100" }, { "CUSTOMER2NAME": "XYZ Logistics", "CUSTOMER2ID": "ORAXYZ10333" }, { "CUSTOMER2NAME": "ABC COMPANY (INDIA) LIMITED.", "CUSTOMER2ID": "ORAXYZ10022" }, { "CUSTOMER2NAME": "EXCEL WORKS", "CUSTOMER2ID": "ORAXYZ10032" }
Step 2.
Since we are going to use aggregate function with $text option, it is necessary to create TextIndex for CUSTOMER2 collection.
Step 3.db.CUSTOMER2.createIndex( { "CUSTOMER2NAME": "text" }, { name: "TextIndex" } )
Below script will compare & score every record in CUSTOMER1 with CUSTOMER2 records using aggregate function.Score greater than 0.75 records are stored in CUSTOMER1 record item as match property.
Output of the above script will be likedb.getCollection('CUSTOMER1').find({}).noCursorTimeout().forEach(function(item) { var customername = item.CUSTOMER1NAME var collection = db.getCollection('CUSTOMER2').aggregate( [ { $match: { $text: { $search: customername ,$diacriticSensitive: true }} }, { $project: { CUSTOMER2NAME:1, CUSTOMER2ID:1, _id: 0, score: { $meta: "textScore" } } }, { $match: { score: { $gt: 0.75} } }, ]) if(collection._batch.length >0) { item.match = collection._batch; } db.CUSTOMERMAPPING.insert(item); })
Result will have all records from CUSTOMER2 with score property.{ "_id" : ObjectId("59540979a2109c38f5ca7112"), "CUSTOMER1NAME" : "ABC COMPANY", "CUSTOMER1ID" : "CRM10003", "match" : [ { "CUSTOMER2NAME" : "ABC COMPANY (INDIA) LIMITED.", "CUSTOMER2ID" : "ORAXYZ10022", "score" : 1.25 }, { "CUSTOMER2NAME" : "ABC COMPANY LTD.", "CUSTOMER2ID" : "ORAXYZ100", "score" : 1.33333333333333 } ] }
Step 4.
Now the step is take the maximum score from match records
$out will create result to new collection 'HIGHSCOREMATCH'db.getCollection('CUSTOMERMAPPING').aggregate([{ $unwind: '$match' }, { "$group": { "_id": "$CUSTOMER1ID", "CUSTOMER1NAME": {"$first": "$CUSTOMER1NAME"}, "maxScore": { "$max": "$match.score" }, "matchgrp": { "$push": { "account_name": "$match.CUSTOMER2NAME", "acount_number": "$match.CUSTOMER2ID", "score": "$match.score" } } } } , { "$project": { _id: "$_id", CUSTOMER1NAME: "$CUSTOMER1NAME", topmatch: { "$setDifference": [{ "$map": { "input": "$matchgrp", "as": "matched", "in": { "$cond": [{ "$eq": ["$maxScore", "$$matched.score"] }, "$$matched", false ] } } }, [false] ] } } },{ $out : "HIGHSCOREMATCH" }],{allowDiskUse:true})
Most matched record from CUSTOMER2 will be mapped to CUSTOMER1 record.{ "_id" : "CRM10003", "CUSTOMER1NAME" : "ABC COMPANY", "topmatch" : [ { "account_name" : "ABC COMPANY LTD.", "acount_number" : "ORAXYZ100", "score" : 1.33333333333333 } ] }
Even the match will not be 100% accurate, it reduces our time for mapping manually.
Exported the final output in a excel and business user reviewed fixed the wrong matches. It reduces the time for us, instead of doing millions of records manually search and mapping.
Thanks