Entity resolution

One of the main hurdles of creating the CJARS database is the lack of a linking ID variable to identify the same individual across different stages of the criminal justice system. For example, an individual sentenced to prison will often appear in police records when they are arrested, court records when the case is adjudicated, and corrections records when incarcerated. Data from these three stages may come from different providers and without linkage variables. To solve this problem, CJARS developed a probabilistic matching algorithm to generate a roster file and CJARS ID for all unique individuals in our data. This roster file allows us to track individuals across the criminal justice system using our algorithm-generated CJARS ID.

The algorithm was constructed using personally identifying information (PII) data provided by a large county court system and a large prison system in the United States. This data includes both biometric identification numbers and variation in PII within a given biometric ID, allowing us to simulate the naturally occurring variation in PII across different data providers. We use a sample of these data to train a model which is then used to predict the match status of all records in the data. We use the biometric IDs to measure the out-of-sample performance of our algorithm by comparing the algorithm-defined IDs with the biometric IDs. The table below shows that the CJARS algorithm achieves a precision rate greater than 99% and a recall rate greater than 90% in both jurisdictions.

County court records State prison records
(N=4.2m) (N=2.5m)
Performance Measure Definition Record pair comparison Entity space Record pair comparison Entity space
Accuracy (TP+TN) / (TP+FP+TN+FN) 1.000 1.000 1.000 1.000
Precision TP / (TP+FP) 0.992 0.985 0.991 0.993
Recall TP / (TP+FN) 0.900 0.940 0.958 0.986
F-measure 2 x ([Prec. x Rec.] / [Prec. + Rec.]) 0.944 0.967 0.974 0.986
False Positive Rate FP / (TN+FN) 0.000 0.000 0.000 0.000