FAQs

Digital Forensics (2)

WHAT IS DE-DUPLICATION?

De-duplication is one of the most effective and common ways in which to reduce the volume of documents for review. De-duplication is also one of the most confusing areas of e-discovery where technology firms and lawyers often misunderstand each other. The starting point is to understand how duplicates are identified so as to understand what constitutes a ‘duplicate’.

The easiest and most common form of de-duplication is by MD5 or ‘hash’ value. The MD5 stands for Message Digest algorithm 5 which has become the industry standard algorithmic value for calculating a 32 digit hexadecimal number (i.e. a number consisting of 32 characters where each character is one of 16 possible characters 0-9 and a,b,c,d,e,f).

The MD5 value is effectively a ‘fingerprint’ for an electronic file that reflects most of the file’s metadata fields and the content of the file. If two documents have the same MD5 value they will be identical.

For emails the approach adopted is slightly different in that different metadata fields may be selected to calculate what is known as a cryptographic hash value (otherwise known as a ‘hash’ value). For example, the Nuix software creates MD5 hash values for emails based on the to, from, cc, subject and body text without reference to spaces and attachment data.

Having calculated MD5 values for all electronic documents it then becomes a matter of comparing the values across the collection of data to identify (and remove or hide) the duplicates.

However, consideration needs to be given as to the approach to de-duplication owing to the issues of context and document ‘families’. The issue of document ‘families’ is most important in relation to emails and their attachments. For instance, it is common for identical files such as a Word document or Excel spreadsheet to be attached to non-identical emails. This may occur for instance where an email is sent attaching a spreadsheet file which in turn is forwarded on to another person in a new email without having altered the original Excel file.

Normally lawyers will want to review emails and attachments together even to the extent that such attachments may be duplicated in the document collection. For this reason the most common approach to de-duplication is to de-duplicate across emails at the ‘top level’ thereby leaving duplicated attachments in the database.

The second consideration is whether to duplicate across custodians or only within custodians. In a typical scenario, we might have the email boxes and ‘my documents’ folders from a number of people who are from the same firm and who were working on the same project that is the subject of litigation. It is normal for these people to have been on the same distribution list and to have therefore received the same emails. When their documents are collected there will be a high degree of duplication.

The question of whether to de-duplicate across all custodians or only within each custodian may depend upon the importance of knowing who had what documents / emails in their possession. It is worth noting that in relation to emails the metadata (sender, recipients) will generally provide this information.

Reflecting on the nature of electronic documents collected and the nature of the way in which custodians were likely to have been communicating with each other will provide some guidance as to the likely level of duplication within a collection of electronic documents. For instance, if various copies or backups of the same custodian’s documents are collected, the level of duplication may be quite high. If the same is done for a number of custodians who worked closely together on the same projects, the level of duplication across the custodians will also normally be quite high.

By ‘customised de-duplication’ we refer to a process of de-duplication using narrower criteria than that normally used. For example, we have encountered instances where use of Blackberry devices or certain email archiving systems has inserted non relevant additional lines of text such as a confidentiality statement which causes otherwise identical emails not to be identified as duplicates. Millnet’s programmers are able to write bespoke programming code to accommodate such circumstances and ensure a more effective de-duplication as a result, thereby saving costs.

Note that de-duplication is not the same concept as near de-duplication or email threads / chains. Refer elsewhere in this glossary for further explanation of these concepts.

WHAT IS A FORENSIC IMAGE?

The process of creating an exact duplicate from a source of electronic data (most commonly a hard drive but also other storage media such as backup tapes, disks, usb keys, flash drives etc). ‘Exact’ in a forensic sense involves creating a copy at the ‘binary’ level (i.e. the 0’s and 1’s that are the building blocks of computer data) and includes files that may be deleted, hidden or otherwise stored on the source media in such a way that they are not visible without the use of specialist forensic software tools.

The resulting forensic image does not resemble the source data as it will be contained within a ‘container’ file which in turn needs to be ‘opened’ using specialist software’. Note also that to the extent that the original source file contained hidden or deleted data that was not visible to the custodian using normal file searching software such as Windows Explorer), this will be available for analysis, search and extraction by a forensic consultant using specialist software tools. The requirement to search for hidden or deleted data goes beyond normal e-discovery processing and gives rise to additional potentially significant forensic consultancy charges.

The question of when to adopt a forensically sound approach to collection of electronic documents (i.e. normally to engage third party experts) is a judgement call to be made by reference to the application of Part 31 CPR.

e-Discovery (13)

WHAT IS BATCHING?

Batching is the term used to describe the process of gathering documents together into “batches” typically for the process of allocating documents to reviewers for categorisation (also referred to as tagging). Historically, batches of documents would normally be allocated on the basis of chronological order where attachments follow emails.  The downside of this approach is that the prevalence of email chains and near-duplicates can mean that batching documents by chronological order may be a less efficient and therefore higher cost review workflow whereby reviewers find themselves reading the same or largely similar documents multiple times

WHAT MATERIAL IS DISCLOSABLE IN CIVIL LITIGATION?

In civil – as opposed to criminal proceedings and regulatory enquiries – the old (pre 2005) approach involved a general obligation to provide disclosure of any document “if it may fairly lead [the party] to a train of enquiry which may either… directly or indirectly.. advance the party’s own case or damage the case of the adversary”

This is known as the “Peruvian Guano” test (the name of one of the parties in the case where the test was expressed) and dates back to 1882.

This is no longer good law – in the sense that the extent of the obligation to disclose material is now determined by the Courts on a case by case basis and there is a wide discretion to make an Order for Disclosure and its terms will vary depending on the case.

Generally the Order will be for “standard disclosure” to be given by the parties and standard disclosure involves a much more restrictive interpretation of what should be disclosed.

The following extract, from the judgment in the case of Nichia Corporation v Argos Limited (2007), elegantly summarises the significant restriction on scope of disclosure which now exists when an order for standard disclosure is made:

“There is more to be said about the change to standard disclosure and indeed to the express introduction of proportionality into the rules of procedure. “Perfect justice” in one sense involves a tribunal examining every conceivable aspect of a dispute. All relevant witness and all relevant documents need to be considered. And each party must be given a full opportunity of considering everything and challenging anything it wishes. No stone, however small, should remain unturned. Even the adversarial system at its most expensive in this country has not gone that far. For instance we do not include the evidence of a potentially material witness if neither side calls him or her. Nor do we allow pre-trial oral disclosure from all potential witnesses as is (or at least was) commonly the practice in the US.

But a system which sought such “perfect justice” in every case would actually defeat justice. The cost and time involved would make it impossible to decide all but the most vastly funded cases. The cost of nearly every case would be greater than what it is about. Life is too short to investigate everything in that way. So a compromise is made: one makes do with a lesser procedure even though it may result in the justice being rougher. Putting it another way, better justice is achieved by risking a little bit of injustice.

The “standard disclosure” and associated “reasonable search” rules provide examples of this. It is possible for a highly material document to exist which would be outside “standard disclosure” but within the Peruvian Guano test. Or such a document might be one which would not be found by a reasonable search. No doubt such cases are rare. But the rules now sacrifice the “perfect justice” solution for the more pragmatic “standard disclosure” and “reasonable search” rules, even though in the rare instance the “right” result may not be achieved. In the vast majority of instances it will be, and more cheaply so.”

There will be instances where an Order other than “standard disclosure” is made and the options are formally set out in Rule 31.5(4)) which will come into force in April 2013 – which essentially lists the sort of options which a Court already has – which include making:

an order dispensing with disclosure;
an order that a party disclose the documents on which it relies, and at the same time request any specific disclosure it requires from any other party;
an order that directs, where practicable, the disclosure to be given by each party on an issue by issue basis;
an order that each party disclose any documents which it is reasonable to suppose may contain information which enables that party to advance its own case or to damage that of any other party, or which leads to an enquiry which has either of those consequences;
an order that a party give standard disclosure;
any other order in relation to disclosure that the court considers appropriate.
Sometimes (although by no means always) it is necessary to collect, process and review on the basis that everything potentially relevant is included – with a view to determining following review what needs to be disclosed – but, equally, there are many cases where it is possible to narrow the scope of the entire exercise at an early point.

WHAT IS BLOOMBERG DATA?

Bloomberg provides access to news, and analysis of what is happening in selected markets.

Integrated into that information service are a suite of communication tools, whereby licensed users can instant message and email (sending text and attachments in both) fellow Bloomberg users.

Typically the communication system is used outside of an ordinary email environment as users are able to communicate prices and trade information with the added ability to then extract all pricing information from their messages in a spreadsheet or other analysis format.

The most common data format which is the subject of a collection is Bloomberg Instant Message or Corporate email> However mobile SMS, Facebook and LinkedIn communications can also be exported.

WHAT IS CODING?

The process of entering fields of information from a document and saving these in a format that will be associated with the particular document typically within a database.  The process most often refers to coding of scanned paper documents and is typically a manual process although there are automated coding technologies which can be useful if the documents to be coded are relatively homogenous, such as standard forms (this is very rare In relation to documents for commercial litigation matters).  Common coded fields include document type, date, time, author, recipients, subject or title and attachment(s).

WHAT IS A CONCEPTUAL SEARCH?

Refers to the latest searching technology which goes beyond keyword and phrase searching to also find documents that are related by reference to concepts. For example, say youare looking for references to banking transactions using keywords such as bank* (which as a ‘fuzzy’ search will return hits on banks, banking etc) and transaction*.

Conceptual search may return hits on terms that a conceptually similar such as ‘deposit, ‘funds’, ‘account’, ‘transfer’ etc to the extent that the algorithms used by the software to run the search have understood such terms to be similar in concept by reference to the initial search terms and the documents being searched.Another form of advanced search referred to as ‘relevance’ which is based not on concepts but a statistical analysis of the words appearing in documents has been shown to be another effective means of extending searching beyond keywords and phrases.

WHAT ARE CONTAINER FILES?

A container file is an electronic file that contains other files. A common example is ZIP files which are often used to ‘contain’ multiple files such as emails, Word documents or Excel spreadsheets.

One reason for using a container file is normally that the ‘container’ file is considerably smaller in size than the sum of the files contained within. The reason for this is that when the files are added to the ‘container’ they are also ‘compressed’ (i.e. made smaller) in size. Emails are often contained within a container file. For instance Microsoft Outlook emails are contained within what is known as a PST (Personal STorage) file (for a single mail box) or an EDB (Exchange DataBase) file (the central store of multiple mailboxes). e-discovery processing extracts the contents of container files so that the individual files can be easily viewed. The extracted contents of container files will normally be somewhere between 50% and 250% larger in size than the original container file which is one of the main reasons why it is difficult to predict how many gigabytes of data will ultimately be hosted (which in turn is generally the main determinant of ongoing monthly costs)

WHAT IS CULLING?

Culling describes the process of eliminating files from a collection of electronic files. Given that the highest cost element of preparation for disclosure is normally that associated with lawyers reviewing documents, culling techniques are employed to reduce the number of documents to review.

Common techniques to cull documents include deNISTing, filtering, de-duplication, near-de-duplication and email thread analysis. The approach adopted to culling documents may be one of selection of documents to include (otherwise known as an ‘inclusive’ approach) versus selection of files to exclude (known as an ‘exclusive’ approach). It is common to use both inclusive and exclusive approaches.

WHAT IS DE-DUPLICATION?

De-duplication is one of the most effective and common ways in which to reduce the volume of documents for review. De-duplication is also one of the most confusing areas of e-discovery where technology firms and lawyers often misunderstand each other. The starting point is to understand how duplicates are identified so as to understand what constitutes a ‘duplicate’.

The easiest and most common form of de-duplication is by MD5 or ‘hash’ value. The MD5 stands for Message Digest algorithm 5 which has become the industry standard algorithmic value for calculating a 32 digit hexadecimal number (i.e. a number consisting of 32 characters where each character is one of 16 possible characters 0-9 and a,b,c,d,e,f).

The MD5 value is effectively a ‘fingerprint’ for an electronic file that reflects most of the file’s metadata fields and the content of the file. If two documents have the same MD5 value they will be identical.

For emails the approach adopted is slightly different in that different metadata fields may be selected to calculate what is known as a cryptographic hash value (otherwise known as a ‘hash’ value). For example, the Nuix software creates MD5 hash values for emails based on the to, from, cc, subject and body text without reference to spaces and attachment data.

Having calculated MD5 values for all electronic documents it then becomes a matter of comparing the values across the collection of data to identify (and remove or hide) the duplicates.

However, consideration needs to be given as to the approach to de-duplication owing to the issues of context and document ‘families’. The issue of document ‘families’ is most important in relation to emails and their attachments. For instance, it is common for identical files such as a Word document or Excel spreadsheet to be attached to non-identical emails. This may occur for instance where an email is sent attaching a spreadsheet file which in turn is forwarded on to another person in a new email without having altered the original Excel file.

Normally lawyers will want to review emails and attachments together even to the extent that such attachments may be duplicated in the document collection. For this reason the most common approach to de-duplication is to de-duplicate across emails at the ‘top level’ thereby leaving duplicated attachments in the database.

The second consideration is whether to duplicate across custodians or only within custodians. In a typical scenario, we might have the email boxes and ‘my documents’ folders from a number of people who are from the same firm and who were working on the same project that is the subject of litigation. It is normal for these people to have been on the same distribution list and to have therefore received the same emails. When their documents are collected there will be a high degree of duplication.

The question of whether to de-duplicate across all custodians or only within each custodian may depend upon the importance of knowing who had what documents / emails in their possession. It is worth noting that in relation to emails the metadata (sender, recipients) will generally provide this information.

Reflecting on the nature of electronic documents collected and the nature of the way in which custodians were likely to have been communicating with each other will provide some guidance as to the likely level of duplication within a collection of electronic documents. For instance, if various copies or backups of the same custodian’s documents are collected, the level of duplication may be quite high. If the same is done for a number of custodians who worked closely together on the same projects, the level of duplication across the custodians will also normally be quite high.

By ‘customised de-duplication’ we refer to a process of de-duplication using narrower criteria than that normally used. For example, we have encountered instances where use of Blackberry devices or certain email archiving systems has inserted non relevant additional lines of text such as a confidentiality statement which causes otherwise identical emails not to be identified as duplicates. Millnet’s programmers are able to write bespoke programming code to accommodate such circumstances and ensure a more effective de-duplication as a result, thereby saving costs.

Note that de-duplication is not the same concept as near de-duplication or email threads / chains. Refer elsewhere in this glossary for further explanation of these concepts.

WHAT IS DE-NISTING?

The process of removing irrelevant systems and other non user created files from a collection of electronic data.  The US National Institute of Standards and Technology ‘NIST’ regularly publishes an updated list of digital fingerprint values for known systems files (the values are the same MD5 format as used to de-duplicate identical electronic files). The process of filtering out files that appear on the NIST list is often referred to as ‘deNISTing’.  This is often the first step in the process of culling data where a broad approach to collection has been adopted – for instance where an entire laptop or PC hard drive has been forensically imaged and therefore contains a large volume of irrelevant systems files.

WHAT IS EDRM?

The Electronic Discovery Reference Model (“EDRM”) provides a conceptual view of the process of identifying, collecting, reviewing and producing data and the various steps showing how the volume of data is whittled down and the proportion of relevant data increases as the process outlined is follows is shown in this diagram:

The Boxes

Each “box” represents a major stage and, within each box, there can be a number of detailed sub-processes.

The Coloured Arrows

Although the coloured arrows represent a common general flow of events, you may repeat each process numerous times, frequently returning to earlier stages as your understanding of the matter changes.

The Volume Triangle

The aim is to reduce the volume of information with which you need to work. While you might start with a vast quantity of data, by the end of the process you want to put only a small quantity of the most relevant information.

The Relevance Triangle

The increase in relevance goes hand in hand with a decrease in volume and a progressively more organised set of refined data.

WHAT IS AN EMAIL THREAD?

Email chains are created by forwarding and / or replying to an original source email. Email chains are one of the most problematic issues in relation to efficiently managing email centric document reviews owing to the fact that as chains grow it is common for the text of all prior emails to remain in the body of each new email that is created when forwarding or replying. As a result, whilst there is often a high level of duplication of the content within each individual email in a chain there is still a requirement to review each email as it is possible for the author of each new email in the chain to have altered the email body text and / or to have included or excluded attachments at different points in the thread.

Further complicating the review is that email threads often resemble a ‘tree’ structure whereby a single original email may give rise to hundreds or thousands of separate branches each representing a new email chain.Millnet offer a service whereby the emails at the end of each thread or chain can be identified (and confirmed that such emails contain all the text of those emails earlier in the chain). Depending upon the approach to review it may therefore be possible to review only the end email in the chain. Another approach to using this analysis is that when a review comes across an email that is say irrelevant (e.g. a conversation about football results) the review can identify other emails in the same thread at the click of a button and then tag all emails rather than waiting to come across emails in the thread again and again

WHAT IS A FAMILY?

The most common example of a document ‘family’ is that of emails and their attachments. As a minimum, a document ‘family’ consists of a ‘parent’ (such as an email) to which one or more ‘children’ are ‘attached’. This ‘parent-child’ relationship extends to more than just emails with attachments. For instance, it is increasingly common for the authors of documents to ‘embed’ (insert) files inside other files. For example, a Word document may have inserted within it other Word documents, Excel spreadsheets or even emails which in turn may have other documents attached. Further, it is useful to retain the concept of family relationships between not just emails or other files but also ‘container’ files (refer above) such as ZIP files and even folders.

Just as it is generally more efficient to review attachments to an email at the same time as reviewing the email, it is often helpful to review the entire contents of say a ZIP file or a particular folder at the same time. It is human nature to collect documents in the same electronic ‘folder’ or ‘container’.It is also important to note that the concept of ‘children’ can extend to ‘grandchildren’, ‘great-grandchildren’ etc to the extent that an attachment to an email has in turn attachments which in turn have attachments (or embedded files). It is not uncommon for a single email to belong to a ‘family’ of documents that could be in the hundreds or even thousands of separate electronic documents. One of the key elements of e-discovery processing is that of ‘extracting’ all children, grandchildren etc whilst retaining the attachment relationship information. Whilst it is normal practice to disclose certain documents as an entire ‘family (especially emails with attachments) this has historically resulted in a need to review all of the documents in the ‘family’ prior to disclosure. The approach to reviewing and disclosing document ‘families’ is a key element of designing the ‘workflow’ for review and disclosure (refer below). Let’s say a keyword search is undertaken over an email-centric document collection that results in ‘hits’ on 5,000 documents representing a mixture of emails and attached documents. It is often the case that once all ‘family members’ are ‘pulled in’ by virtue of their association with the 5,000 documents that were responsive to keywords, there may be 30,000 or even 50,000 documents that are potentially discloseable. Depending on the nature of the matter, a strategy therefore needs to be considered for going about the review efficiently whilst minimising the risk of for instance disclosing privileged family member documents that were not responsive to the original keyword search terms but included by virtue of association with a document that was responsive.

WHAT IS A LOAD FILE?

A load file contains the elements of data required to add documents into a litigation support review database. Refer to the definition of a database for further explanation of the various elements.  If particular elements of the data in the database need to be updated (for instance to add new fields that were not part of the original load file) then this is provided in a load file that is often referred to as an ‘overlay file’ in that it ‘overlays’ or appends data relating to existing documents in the database.Where the parties to a dispute are both using a litigation review database then electronic disclosure will normally involve providing the data in an agreed load file format.  It is possible to manipulate data received in an incompatible format, though best practice is for there to be discussion and agreement as to the format of disclosure in the early stages of a matter so as to avoid unnecessary wasted time and cost.

Legal (1)

WHAT MATERIAL IS DISCLOSABLE IN CIVIL LITIGATION?

In civil – as opposed to criminal proceedings and regulatory enquiries – the old (pre 2005) approach involved a general obligation to provide disclosure of any document “if it may fairly lead [the party] to a train of enquiry which may either… directly or indirectly.. advance the party’s own case or damage the case of the adversary”

This is known as the “Peruvian Guano” test (the name of one of the parties in the case where the test was expressed) and dates back to 1882.

This is no longer good law – in the sense that the extent of the obligation to disclose material is now determined by the Courts on a case by case basis and there is a wide discretion to make an Order for Disclosure and its terms will vary depending on the case.

Generally the Order will be for “standard disclosure” to be given by the parties and standard disclosure involves a much more restrictive interpretation of what should be disclosed.

The following extract, from the judgment in the case of Nichia Corporation v Argos Limited (2007),elegantly summarises the significant restriction on scope of disclosure which now exists when an order for standard disclosure is made:

“There is more to be said about the change to standard disclosure and indeed to the express introduction of proportionality into the rules of procedure. “Perfect justice” in one sense involves a tribunal examining every conceivable aspect of a dispute. All relevant witness and all relevant documents need to be considered. And each party must be given a full opportunity of considering everything and challenging anything it wishes. No stone, however small, should remain unturned. Even the adversarial system at its most expensive in this country has not gone that far. For instance we do not include the evidence of a potentially material witness if neither side calls him or her. Nor do we allow pre-trial oral disclosure from all potential witnesses as is (or at least was) commonly the practice in the US.

But a system which sought such “perfect justice” in every case would actually defeat justice. The cost and time involved would make it impossible to decide all but the most vastly funded cases. The cost of nearly every case would be greater than what it is about. Life is too short to investigate everything in that way. So a compromise is made: one makes do with a lesser procedure even though it may result in the justice being rougher. Putting it another way, better justice is achieved by risking a little bit of injustice.

The “standard disclosure” and associated “reasonable search” rules provide examples of this. It is possible for a highly material document to exist which would be outside “standard disclosure” but within the Peruvian Guano test. Or such a document might be one which would not be found by a reasonable search. No doubt such cases are rare. But the rules now sacrifice the “perfect justice” solution for the more pragmatic “standard disclosure” and “reasonable search” rules, even though in the rare instance the “right” result may not be achieved. In the vast majority of instances it will be, and more cheaply so.”

There will be instances where an Order other than “standard disclosure” is made and the options are formally set out in Rule 31.5(4)) which will come into force in April 2013 – which essentially lists the sort of options which a Court already has – which include making:

an order dispensing with disclosure;
an order that a party disclose the documents on which it relies, and at the same time request any specific disclosure it requires from any other party;
an order that directs, where practicable, the disclosure to be given by each party on an issue by issue basis;
an order that each party disclose any documents which it is reasonable to suppose may contain information which enables that party to advance its own case or to damage that of any other party, or which leads to an enquiry which has either of those consequences;
an order that a party give standard disclosure;
any other order in relation to disclosure that the court considers appropriate.
Sometimes (although by no means always) it is necessary to collect, process and review on the basis that everything potentially relevant is included – with a view to determining following review what needs to be disclosed – but, equally, there are many cases where it is possible to narrow the scope of the entire exercise at an early point.

This entry was posted in FAQ. Bookmark the permalink.
  • Contact

    Contact Us

  • Approved Infrastructure

    accreditations