| e-Discovery Software | ||||||||||||||||||||||||||||||||||||
|
Q&A - EDD Processing Software Q: Can Discovery Assistant Produce native load files without tiffing the documents? A: Queue for conversion 'metadata only'. That will allow you to produce source documents without TIFFing. Most metadata except for extracted text is available (note: text from MSG files is still extracted). Q: How long will it take to process a GB? A: One GB takes approximately 20 hours of processing time, and produces 70,000 pages. (1 second a page conversion time, and 1 second a page for import/export/deblank/bates numbering, etc). Q: Can you handle email from Eudora and Thunderbird? A: We can handle native MIME email (similar to how Outlook Express is stored) and we think that is how Eudora and Thunderbird messages are stored. If some tweaking is required, we're willing to do the work. Q: Customer needs Multi-page TIF, metadata and bibliographical data in text delimited format with field headers [associate with doc id number as well]. All parent IDs and attachment IDs must be captured. A: If you export, and choose Character Separated Text, then under Options, select all - That gives you the full set of metadata fields that we extract and export, including Parent ID and Attachment ID. We can extract limited bibliographical data from Office, Excel, PDF, and email. Q: How would I set output to Eastern Standard Time, non-military format? A: We read timestamps as UTC, then convert to local time. Set the machine you are working on to Eastern Standard Time, non-military format, that is how we will produce the date/times. Q: How do you handle corrupt or password protected files? Placeholder file? Exception report but do not process? Placeholder File; set aside and notify us. A: Current solution is to fail on first attempted conversion/import. Failed items can be 'copied', unencrypted/unlocked, then re-imported. OR you can do pass-through (placeholder) for these items. Q: How do I export in chronological order, earliest to latest date? A: When exporting, users normally ask for parent/child sort order. You can also sort and export based on date/time. If you are planning on sorting on date, then best to first set the date format on the machine as YYYY-MM-DD. All formatted dates will then be sortable. Q: We also need a blowback set for our customer's office. A: Do you mean hard copies? You can choose to print converted TIFF files (either stamped or unstamped) from Discovery Assistant. Q: How do we export to CD? A: You can export VOL\BOX, limit the output size to CD, then burn a CD from that exported data. Q: When viewing image and text side by side, when I scroll the tiff, shouldn't the text scroll as well? A: It's an interesting idea, and will add it to our things to look at. Basic issue is we don't know where the TXT is relative to the Image. All we know is this is what printed to get that TIFF file. We could still probably scroll the text as you scroll the image though, and hope that we get it right. Q: Can you print the "all Files" Screen so that you can have a hard copy list of file ID, Bates, Name, Status. Hash etc etc- or can you export that to csv file? A: You can 'copy' to clipboard, and paste into an EXCEL spreadsheet. Other alternative is to use our DAReportManager reporting tool that can load the XML project file, and convert to XLS. Q: How do you actually select a file from list? I highlight it, and then click a command such as convert, and it says no files are selected do you wish to select all files. A: The convert button has an 'arrow' beside it that shows a pop-up: Choices are: All, Selected. Q: What is the mtf file used for, Is it simply a txt document that the attorney can have to see all the metadata fields in an easy to read format? A: The mtf files stores the metadata in before we do an export. Our thinking was when we put in the search feature, we can search the MTF file without first having to do an export. Q: Is there a way to see a dedup report of files that were duplicates? I think there is, but I didn't see it. A: If you sort on the 'duplicated' key, then that gives you a list of all files for which there is a duplicate. If you run the project file through Global Deduplication, then that will identify the global primary file, and Global count. On any particular duplicate, you can list it's parent, it's siblings, or it's duplicates by pressing one of the action buttons in the button bar. Q: Can you tell me what filtering technology you use (Native/Mapi, Verity, Stellent, or Open Source) for the following: A: We use the following methods to extract files and metadata:
File type is determined using binary data to get an extension type. Extension type is then translated through the operating system to get a file type. At the moment, we do not include an integrated search. However, we do support extraction of source, tiff, text, and metadata as separate files. Search tools can comb through these files looking for matches.
Conversion are done using the PrintTo interface. Custom converters exist for Word, Excel, PowerPoint, and internal Office / Notes file formats. Q: Do you know what the limits are on the number of files you can process at a time? I have 3,295 files. I can't get it to load more than a couple hundred at a time. Thanks. A: You should be able to load up to 50,000+ files in a single project. Suspect the problem is that you are using the 'Add Files' dialog. To load in a directory of files, use the 'Add Folder'. Optionally, you can open explorer, and 'drag' the files across. Finally, you can construct a list of files, and 'Add From List'. Q: When I go to convert, there are no output file format types (list is blank) in the Convert dialog. A: It looks very much like an incorrect installation. The quick fix is to email us the file: Q: My lawyer wants all messages and the attachments to be exported in one TIFF file. A: We can't quite do that, but here is something that comes close: To itemize what files belong to what message files:
Q: I'm having a problem processing MSG files. The Outlook dialog comes up, gets dismissed, but the document does not process. A: Possible fixes: Am wondering if I can get you to temporarily turn our 'clickyes' program OFF. Can do this by renaming the registry item: HKLM\Software\ImageMAKER\DA_ClickButton. (add an extra character to the name). Should now be able to manually close the dialog (and allow for 10 minutes of activity). If this change works, then we'll look further into what the problem might be. There is an alternative 'close' tool called 'ClickYesSetup' that you can install from the Start / Programs / ImageMAKER Discovery Assistant group. If that is installed, and set to active, and it solves the Outlook Dialog problem, that would be another solution. ... also, if you have any other third party software that is set up to close these outlook dialogs, that too could explain the problem - two applications are trying to close the same dialog. Q: How do I exclude Outlook attachments from being processed? A: Under the Options / Scan tab, set 'exclude outlook attachments'. When you add in a MSG file, we now ignore all attachments. Q: How do I control the formatting of the MSG file to look like it was printed from Outlook? A: Default behavior is to extract HTML / RTF / TEXT, then use the native application to print that file. To get 'outlook' formatted output, from Options / Outlook tab, change rendering from 'default' to 'Outlook MSG'. Q: Can we generate a report or export to a .csv / spread sheet of all files that couldn't be converted, non-convertible or failed, that contains the file name, path, etc.? A: If you download DAReportManager, that will give you the ability to create a spreadsheet containing all the information from each of the tabs. (contact ImageMAKER for download instructions). Q: How do I handle foreign character sets (like cyrillic)? A: As far as we know, Summation and Condance do not support Unicode. For this reason, metadata and extracted text are exported as MBCS (multi-byte character strings), which can be handled by Concordance/Summation. The TIFF files print using the native application. If the native application supports foreign chars sets, then those characters are properly represented in the TIFF file. To set up for MBCS output:
Adding Arabic language support the user did not add Arabic language support during installation wizard, after installation completes the user can add the Arabic language support by the following steps:
You should be able to review the extracted TXT, and metadata using Notepad. Text data is MBCS encoded. The following MBCS character sets (code pages) are supported:
Q: I'm wondering if you can give me some additional insight into why there are so many blank pages in this spreadsheet. If you look at it in print preview in Excel, go to page 19 and you will find 19-29 blank. Also 49-60, 79-94, and I quit looking at that point. As far as I can tell, there is no data. Is there something going on with formatting, maybe? A: In spreadsheets, users usually set the 'print range' of active cells to print - and that range usually contains data. However, the print range can miss huge swaths of the spreadsheet - and it's the swaths that we want in the discovery process. Discovery Assistant defaults to printing the entire sheet, blank pages and all, which is defined to be the largest box that contains the top left cell and bottom right cell. Discovery Assistant spreadsheet formatting settings are controlled through settings in the Admin / Configure / Excel Options page. The following settings are supported:
Q: Quick question on the hashes that DiscoveryAssistant produces. A client of mine is insisting that the MD5 hash does not have any dashes in it, though the hash values DiscoveryAssistant produces does. If the dashes are removed via search and replace, would it still represent the correct hash value? A: Yes, it's still valid. We put the dashes in so the number is human readable. One other point - you can customize how much of the file to hash. We then binary compare if we get any matches. If you are doing global deduplication using the hash value, make sure the Options / De-duping /HashCode sample size is set to 0. Q: The word document I am converting contains 2 pages, but when I process I only see one page. A: If you see only one page in the TIFF file, and that page contains everything that the original doc contains, then I have a very good explanation... When switching print drivers, word documents reformat slightly. You can see this when changing the default print driver in Word, with a Word document open. The text position will change slightly. The same file printed to an HP printer may produce two pages, and when printing to a Postscript printer come out as only one page. To get exact equivalence between the two printers, we need:
Q: What is 'child next' order when assigning Bates Numbers. Also, do you ever assign the same bates number to more than one document? A: We load the files and attachments in a slightly different order from how case management systems request them. We load in the first layer of attachments, then go back looking for attachments to the attachments. Understand that the 'proper' way of importing/exporting is to list the first attachment, and it's children before listing the second attachment. For lack of a better term, we use the name 'child next' order. To get the right export order, we recommend assigning Bates Numbers in child next order, then sort afterwards on bates number. Then export. You'll note that we also fill in the bates range for each message at this point. (allowing you to confirm that everything is listed in the right order). As for handling duplicates, you have 3 choices at time of conversion:
If you ignore (or copy )duplicates, then each file gets it's own bates number. If you skip duplicates, then only the first 'converted' file gets assigned a Bates number. If you link duplicates, then multiple 'duplicates' will all be assigned the same bates number. Q: What process is used to identify whether a file is readable. Is a listing such as the NSRL used? Does it utilize the extension of the file or does it extract the file to determine what it is (i.e., if an Excel spreadsheet has a .AAA extension instead of .XLS extension, will it still recognize it as an Excel spreadsheet and treat it accordingly? A: Process by which we recognize the file type is:
Q: There is a process that a lot of law firms are doing and I was wondering if you had a way to accomplish this. In order to save money the law firms will request that we only deliver the metadata and OCR, no images. Then, they tag the responsive docs they want and ask us to only TIFF those. Any ideas? A: When converting, you can choose:
If you are dealing with just email, the email text is extracted as part of the MetaData. That might be good enough for you needs (and is fast). Only time you would want MetaData and text is if you want the text contents of attachments, and the text contents of loose documents. Once you've done the conversion, you can export that data to Concordance/Summation/IPRO/CSV for analysis. Trick is to remember to extract the FileID as one of the export fields. User imports the data into Concordance/Summation/Case Management System, analyses it, then exports back out a list of documents that they want Tiffed. That list of items needs to include FileID as one of the record items. Using a text editor, or Excel, you should be able to remove any extra data, creating a TXT file with just one FileID per line. Next, re-open the Discovery Assistant project, create a copy of the project (save as), then remove all converted/queued items. Then, from the Menu, select: Project / Queue from FileID list. Open the FileID list you just created, and only those items with FileID's specified will be queued for conversion. Convert, bates stamp, export, and you're done. One advantage of this method... is the client needs to use your services twice, once to get the metadata with FileID, and the second time to get the TIFF files. It would be difficult for the client to identify what documents to process without getting you to first do the extraction. If the client were to give you loose files, they run the risk of loosing the file relationship information - and potentially changing the data as items are extracted from the original source documents. Q: What caliber of person is required to competently operate the DA software? A: The product needs to be set up and run by an IT knowledgeable person (2+ years experience). Most problems that we see are 'setup' problems - getting the product up and running in a production environment. Startup issues we see are:
Once the product is up and running, relatively junior people with little or no IT experience can run it. If a problem comes up, then the senior IT person needs to look at the problem, then call us if it can't be solved locally. Q: What training programs do you offer? A: A person with IT experience can get the product up and running without any additional training on our part. That IT person in turn needs to train the user, and be available to answer any user questions. User training by anyone other than the IT person is difficult, as it is the IT person who is going to have to answer the first line questions, and who needs to do the re-installs and first-line trouble shooting. Q: What database architecture are you guys using, SQL? Access? And do you have any sort of distributed processing for large jobs that I may want to spread over multiple machines? A: We use an XML file loaded into memory to get the fastest possible database processing speed. Maximum practical size of XML file is 500,000 files. To handle data sets larger than half a million files, we provide you with an add-on tool we call Terabite that builds an MDB file containing millions of files, then export this list to multiple 'Discovery Assistant load files' by breaking the list down by bytes (2 Gigs), or number of files (100,000) - both numbers are user configurable. To set up multiple jobs, break the data into smaller projects, (possibly using our Terabite program to automate the process), then load and process multiple projects on multiple machines. As long as the source files are referenced with a \\Server\share UNC name, it doesn't matter what machine is used to do the processing. In addition to the Terabite tool, we also provide a tool to map XML project files back to XLS or MDB to facilitate reporting. Q: What are the limitations as to the maximum number of files to be processed in a batch? A: Our current recommendation is to keep the number of files per project batch to less than 500,000 files. When processing the batch, to optimize speed we load the file list into memory. As the list size grows, the time to load/update/save the list gets increasingly longer. 500,000 files seems to be a practical limitation. There are no physical limits on files sizes, or number of pages per file. Recognizing that different data sets have different file densities, the rule of thumb we use is each gigabyte of data translates to 70,000 converted pages. At an approximate conversion speed of 1 page per second, a single copy of Discovery Assistant should be able to process 3,600 pages an hour, or a gigabyte of data every 20 hours. Our recommendation is to keep the lists sizes to 100,000 files, or a 2 gigabyte maximum. That keeps job processing at less than 24 hours per job. Q: How do you process word files that have mark-ups with in the doc file? A: The default conversion process converts the document similar to how it was saved. If markups are displayed when the document was last saved, then the markups are printed. There is a difference in how the markups print based on whether you are using Office 2000, or Office 2003. Office 2003 prints much more information about markup changes than Office 2000. Q: How do you handle password protected files? A: If the file is password protected, our current default behavior is to time out waiting for the application to print. We then kill the application. The default timeout value is 30 seconds. If there are a lot of password protected files, then conversion is going to go very slowly. Failed files can be 'moved' to another directory, and then set up for password cracking. Our understanding is that cracking a password can take multiple hours per file, and not something to try in real time. At some point in the future, we'll look at trying to determine if a file is password protected before attempting the conversion. Note: there are a number of 3rd party applications designed to handle password detection and cracking for: Excel, Access, Word, RAR, PDF, Outlook.
Q: What are the benefits of paying Maintenance and Support? A: Having paid up maintenance ensures that you have continued access to developer support, and that we can help you with any problems that come up. We understand that your business is to convert documents to TIFF for your client, and that if you have problems, you need them fixed as quickly as possible. If you have specialized requirements, our developers are also available to do custom development. Q: How do I add support for GIF files? A: On Windows XP and Windows 2003, the Windows Picture and Fax Viewer can do the job. To set the default, go into explorer, do a search for GIF, then open a GIF. At that point, the file association will be set. Can then do a re-check from Discovery Assistant, and the GIF files will be convertible. Same process for JPEG. On a Windows 2000 machine, run the Imaging For Windows application, and set the menu item: Tools / General Options - open images in Imaging. Q: How do I change from using UltraEdit back to making Notepad the default TXT viewer? A: Ultra edit file associations can be difficult to over-rule, especially if you are looking to use Ultra Edit for other purposes. Fix is to go into Discovery Assistant / Admin / Configure / Documents Next, put the following command into the 'override cmd' edit box at the bottom of the dialog: Note: Make sure that Notepad File / PageSetup has the header/footers removed.... Q: What is the best setup? workstations writing to a central SQL DB? Workstation with discovery assistant and SQL local? A: Ideal setup is as follows:
Discovery Assistant native database is XML (flat file). We load the whole file into memory, and all database operations are fast. At one point we modeled using an SQL database from within Discovery Assistant. However, the speed of access was slow. Made it difficult to 'assign' large groups of files to different status values, etc. Q: What are the best practices for planning a multi-batched project? A: We've been developing the tools to manage terabytes of data. Believe best practices work as follows:
We've found that by de-coupling the process from an SQL database, and using XML load files managed entirely in memory, that our ability to access data, compare files for duplicates, and manage project queues of 50,000+ individual files - is significantly sped up over traditional database access times. Q: What is the procedure for merging batches and renumbering? A: Quick answer is that you need to first convert before assigning bates numbers. Am suggesting that if batches are numbered, and delivered in that order, the process of bates numbering can occur concurrent with the delivery of petrified data. Multiple batches can be converted at the same time. Batch 2 can't be bates numbered until Batch 1 conversion is complete. In the event that files do have to be re-done due to a customer request, there are a number of options that allow you to set bates numbering to what ever start number is decided upon. Q: What QC functions can be run on processed data while another batch is being processed (on the same machine) if any? or does the current batch need all local resource to be as fast as possible? A: XML files are used to define the project. If you want to QC a project as it is being converted, then you have to do it in the active conversion session. If you want to QC a project that has already been converted, then this can be done on a second machine. User can open the project file, then view the converted data. You can install as many 'QC' versions of Discovery Assistant as you like, for no extra charge. The QC versions do not convert, but can 're-queue' files to be converted. Q: Keyword Searching? A: Not something we currently offer. Best to export the document MetaData to Concordance or Summation, and then work from there. As an alternative, it is possible to complete the conversion, then use Google Desktop to index the resulting files. Q: What are the settings for TimeOuts in Discovery Assistant? A: Default values are (in seconds):
If you want to make the timeout infinite (for really large multi-page files):
If you are running really large files, we find that the spooler sometimes runs out of room. If so, need to set the print driver Advanced Settings (Start / Settings / Printers / Properties / Advanced tab) change from "spool print documents" to "Print directly to the printer" Q: When Discovery Assistant screens date macros from putting in today's dates in Word and Excel documents, what date if any does in fact appear in the petrified image? Is the date macro removed entirely and no date appears? does the date of last modified appear? A: We switch the machine date to the date of the image being rendered. (date it was last saved). The macro still runs, and the date gets filled in. It's difficult to disable ALL macros. It's also misleading to not put in a date. When we output Word or Excel documents we set UpdateFields=false (using WordPrintTo.exe, ExcelPrintTo.exe) so date fields don't get changed from their last value. This operates independently of the feature that temporarily changes the system date (enacted through the admin interface). Usually changing the system date is only required if the conversion process produces headers/footers with the date/time and these are required to reflect the last saved date/time. Headers/Footers can usually be turned off for most documents by opening the parent application (Word,Excel, etc.) with a blank document, selecting PageSetup from the file menu, and removing the header/footer. NOTE: for this to feature to work, you must turn ON the Date Handler in Discovery Assistant. To do this, go to the Admin dialog, select Configure, and under the Options tab, select: 'Reset System Time to LastWrite Time before conversion.' Warning: with this option turned on, do not use this machine during conversion for other business functions. Q: My PST file contains 27,000 emails, each of which contain a signature file. In total there about 10 different signature files. When I go to export to Summation or Concordance, for each signature file reference, the list of duplicates is enormous. The DII file itself is larger than 80 Gigabytes. What can I do? A: Go back to the AllFiles tab. You should be able to locate one or more of these common signature files. Identify the FileID and Hash value, then sort on hash value. Once you've re-located the signature file in question, delete all the copies of this file from the data set (OK to leave one). You may want to make a note of the number of copies you've deleted. Do this for any of the other signature files that are a problem. Then, go back and try re-exporting. The Data file should now be much smaller. Q: Recently, my system has been crashing more often, due to a process or program "CiceroUIWndFrame". A: I found out that this is the "Speech and Handwriting Recognition" part of Office XP. To de-install it, go to :
Q: If I have a job already stamped, but don't want to print the whole thing, is there a way I can select which pages to print? What happened was the computer and/or printer caused the print job to stop midway through (I assume the software won't cause this).
Q: Any information you have on performance, e.g. number of files converted per minute? A: In regard to speed, there are no hard numbers. We tested 7 WORD files (simple graphics, lots of text) with the following page counts: 3, 71, 3, 5, 16, 3, 204 3.2 GHZ machine, no hyper-threading. 1 Gig of memory and big hard drive.
At 300 dpi, pages per minute ranges from 70 - 84. For smaller file sizes (one page per file), they have been seen to go as low as 30 pages per minute. At 200 dpi, pages per minute ranges from 130 - 150 per minute. For smaller file sizes (one page per file), conversion speed can go as low as 30 pages per minute. For files containing a large number of pages, conversion speeds are somewhere above 200 pages per minute. Conversion speed can be increased dramatically (doubled) if you switch to Windows Fast Dithering as the dithering option (Printing Properties / General Tab / Printing Preferences / Advanced / Image Rendering Options - color mode). Basic trend:
The Windows Fast Dither uses a reduced memory area for conversion, and 'dithers' the text and graphics to B&W as they are being written to the surface. The Error Diffusion Dithering method dithers the whole image when it is being written to file (and can take up significantly more memory). Default for the ODC Carrier is to set 'Windows Fast Dither' to on. Q: One problem I have run into is that DA has difficulty with and often get an error message when I try to convert a .pdf file that is larger than 500 pages. What can I do? A: The PDF problem I believe is related to us running out of spool file room. The PDF application when printing is putting data into the spooler much quicker than we can pull the data out. Quick fix is to go into the printer properties / Advanced tab for the ImageMAKER XDC Service1, and set the spooler properties to 'print directly to printer'. This means that as soon as data goes into the spooler queue we take it back out. Acrobat is going to remain open much longer. Only problems we are aware of with always having this setting on is that some Word documents with landscape/portrait pages don't print correctly. Also, some documents will print slower. Q: How does your software handle error reporting, if at all? A: When you import files to be converted, the product first identifies those files that can be converted from those that can't. Then, during the conversion process if there are any failures, the failed files are listed in the 'failure' tab. Converted files are listed in the 'converted' tab. Files can be 're-converted' by moving them back into the 'queued' tab. We've tested file lists of up to 250,000 files with no problems. If you need to convert larger numbers of files, then we suggest breaking them down into sets of 100,000. Q: Does the software deduplicate emails? If so, based on what criteria? A: We've built in automatic duplication removal. We create check-sums for each item (email, attachment, or file) that we then compare to all other items in the list. Before confirming a match, we do a complete byte compare between the two files. For emails, we check the 'text' contents against each other, rather than the whole msg file. (every MSG file is unique). The user then has the option of converting the duplicates (or not), and exporting the duplicates (or not). Q: How does the software handle Excel spreadsheets? Does it remove blank pages automatically? A: We've developed some specialized Excel spreadsheet software to do the following:
Q: What do I do to ensure there is white space for the bates stamp? I want to ensure that the Bates Stamp does not obscure the underlying data. A: The solution is to do the conversion into a smaller area than what the original image size is. The print driver has a setting that can modify the page margins. Default is 0 margins (output TIFF image size is same as input TIFF image size). Go to Settings / Printers. Select properties for ImageMaker XDC Service1. Under the Device Settings tab, look for unprintable regions. Set the unprintable regions so that you have room for the bates stamp. Suggest the following settings:
Q: What happens if one of the servers or clients crashes in the middle of a conversion? How does DA recover? How does the user manage events like this? A: All conversions are controlled by the client, but can be handled on either a client or server. In the event that the server machine dies, the client will time-out, then go on to try the next item in the list. In the event that their client machine dies, the user can re-start that machine, and it will pick up where it left off. All conversions status information is maintained in an XML file. This file is updated after every conversion. |
||||||||||||||||||||||||||||||||||||