Thursday, 29 June 2017

MongoDB - Using Aggreagte and Text Search to compare and map records from two different datasets collections

Recently I had a scenario to compare two different data sets to find the matched records based on a text fields. The challenge is cant do it exact matching.

For Example
DataSet1 contains a record with field as
'Name : ABC Company'

DataSet2 may contains records as
Name : 'ABC Company Ltd.'

So decided to find the match based on ranking system.
Take every record in CUSTOMER1 and compare with CUSTOMER2 records and put score for the match. If the score is greater than 0.75 include CUSTOMER2 record to CUSTOMER1 record with score. Then get the highest match scored record.

Step 1.
Imported both data sets to MongoDB as two different collection, customer1 and customer2.
CUSTOMER1 data Sample
{
    "CUSTOMER1NAME": "ABC COMPANY",
    "CUSTOMER1ID": "CRM10003"
}


CUSTOMER2 data sample
{
  "CUSTOMER2NAME": "ABC COMPANY LTD.",
  "CUSTOMER2ID": "ORAXYZ100"
},
{
  "CUSTOMER2NAME": "XYZ Logistics",
  "CUSTOMER2ID": "ORAXYZ10333"
},
{
  "CUSTOMER2NAME": "ABC COMPANY (INDIA) LIMITED.",
  "CUSTOMER2ID": "ORAXYZ10022"
},
{
  "CUSTOMER2NAME": "EXCEL WORKS",
  "CUSTOMER2ID": "ORAXYZ10032"
}
 

Step 2.
Since we are going to use aggregate function with $text option, it is necessary to create TextIndex for CUSTOMER2 collection.

db.CUSTOMER2.createIndex(
                           { "CUSTOMER2NAME": "text" },
                           { name: "TextIndex" }
                         )
 
Step 3.
Below script will compare & score every record in CUSTOMER1 with CUSTOMER2 records using aggregate function.Score greater than 0.75 records are stored in CUSTOMER1 record item as match property.

  
 db.getCollection('CUSTOMER1').find({}).noCursorTimeout().forEach(function(item) {  
        var customername = item.CUSTOMER1NAME    
        var collection =    db.getCollection('CUSTOMER2').aggregate( [
        { $match: { $text: { $search: customername ,$diacriticSensitive: true  }} },
        { $project: { CUSTOMER2NAME:1, CUSTOMER2ID:1, _id: 0, score: { $meta: "textScore" } } },
        { $match: { score: { $gt: 0.75} } },
     
        ])
        if(collection._batch.length >0)
        {
            item.match = collection._batch;
        }
        db.CUSTOMERMAPPING.insert(item); 
    
})
Output of the above script will be like
  {
    "_id" : ObjectId("59540979a2109c38f5ca7112"),
    "CUSTOMER1NAME" : "ABC COMPANY",
    "CUSTOMER1ID" : "CRM10003",
    "match" : [ 
        {
            "CUSTOMER2NAME" : "ABC COMPANY (INDIA) LIMITED.",
            "CUSTOMER2ID" : "ORAXYZ10022",
            "score" : 1.25
        }, 
        {
            "CUSTOMER2NAME" : "ABC COMPANY LTD.",
            "CUSTOMER2ID" : "ORAXYZ100",
            "score" : 1.33333333333333
        }
    ]
}
 
Result will have all records from CUSTOMER2 with score property.

Step 4.
Now the step is take the maximum score from match records
 
db.getCollection('CUSTOMERMAPPING').aggregate([{
    $unwind: '$match'
}, {
    "$group": {
        "_id": "$CUSTOMER1ID",
        "CUSTOMER1NAME": {"$first": "$CUSTOMER1NAME"},
        "maxScore": {   
            "$max": "$match.score"
        },
        "matchgrp": {
            "$push": {
                "account_name": "$match.CUSTOMER2NAME",
                "acount_number": "$match.CUSTOMER2ID",
                "score": "$match.score"
            }
        }
    }
}
, {
    "$project": {
        _id: "$_id",
        CUSTOMER1NAME: "$CUSTOMER1NAME",
        topmatch: {
            "$setDifference": [{
                    "$map": {
                        "input": "$matchgrp",
                        "as": "matched",
                        "in": {
                            "$cond": [{
                                    "$eq": ["$maxScore", "$$matched.score"]
                                },
                                "$$matched",
                                false
                            ]
                        }
                    }
                },
                [false]
            ]
        }
 
    }
},{ $out : "HIGHSCOREMATCH" }],{allowDiskUse:true})  
 
$out will create result to new collection 'HIGHSCOREMATCH'
  {
    "_id" : "CRM10003",
    "CUSTOMER1NAME" : "ABC COMPANY",
    "topmatch" : [ 
        {
            "account_name" : "ABC COMPANY LTD.",
            "acount_number" : "ORAXYZ100",
            "score" : 1.33333333333333
        }
    ]
}
 
Most matched record from CUSTOMER2 will be mapped to CUSTOMER1 record.
Even the match will not be 100% accurate, it reduces our time for mapping manually.
Exported the final output in a excel and business user reviewed fixed the wrong matches. It reduces the time for us, instead of doing millions of records manually search and mapping.

Thanks