How to Find and Remove Duplicate Documents
Discovery Assistant is an eDiscovery application that can identify and remove duplicate documents from electronic stored documents, including documents from PST files, zip files, scanned documents, Microsoft Office documents, and PDF files.
When document sets are produced as part of an eDiscovery request, it is normally required that duplicate documents be identified and removed. The removal of duplicate documents reduces the overall volume of documents produced, thus reducing the amount of data that needs to be reviewed, and reducing overall litigation costs.
Detection of Duplicate Documents:
As documents are loaded into Discovery Assistant, an MD5 hash value is calculated for each file. The MD5 hash value is a calculated 128 bit binary signature that represents the file contents. Duplicate documents are identified by comparing MD5 hash strings. Documents that have identical MD5 hashes are then binary compared to confirm they match.
Deduplication of documents is usually done at the message level. A message is deemed to be a duplicate only if the message contents and the attachments are exactly the same. Deduplication can also be done at the file level to further reduce the data volume.
To remove duplicates from a document set:
Identifying Duplicate Documents Across Multiple Projects:
The Discovery Assistant Terabite application supports grouping multiple Discovery Assistant projects into one large document set in order to identify global duplicates. A global duplicate count is also created at this point.
Other ways to reduce eDiscovery data volume:
Discovery Assistant supports a number of other tools to help reduce data volume and identify key documents.
To download and try out Discovery Assistant:
To test the de-duplication capabilities of Discovery Assistant, download and install a demo copy of the program from: http://www.discoveryassistant.com/Download/Downloads.asp
Load in a test set of documents, and deduplicate.
To contact Discovery Assistant:
“About a year ago, I was offered some software with the purchase of a copy machine
but the eDiscovery component didn’t deduplicate and it was too expensive. I mean, way out there!
I’m bullish on Discovery Assistant, though, because it’s priced right,
easy to use and allows me to print less while doing so much more electronically.”