Skip to content

A Million Documents At Your Fingertips

2009 August 14
by recapthelaw

In our last post, we mentioned that we were already working with other organizations that support judicial transparency to help us build the public repository that lies at RECAP’s foundation. Public.resource.org, led by Carl Malamud, has been especially helpful in this regard. They have a vast repository of court documents, weighing in at more than 500 gigabytes in total. Over the last few weeks we’ve been pre-stocking the archive with these documents, and we recently crossed the million document threshold.

What this means is that installing RECAP will not only help you contribute to government transparency, but it’s likely to start saving you money right out of the gate. For example, if you practice law in New York City, you’ll be happy to know that we have 238,098 documents from the Southern District of New York. If you have RECAP installed, you can use PACER the way you normally do, and RECAP will automatically inform you if the document you need is already available for free.

Here is a table of the other courts where we have a significant number of documents:

Court No. of Documents
District of Alaska 52,797
Northern District of California 190,470
District of the District of Columbia 219,049
District of Delaware 182,900
Central District of Illinois 21,378
District of Massachusetts 217,315
Southern District of New York 238,098
Eastern District of Pennsylvania 20,530

We anticipate importing about a million more documents from Public.resource.org in the coming weeks. The good folks at Justia have also expressed interest in contributing to the repository.

Unfortunately, even after we import all the documents from these organizations, we will still be several million documents short of our goal of a comprehensive, free repository of court records. That’s where you come in. By installing the RECAP public beta, you can make an important contribution to the goal of free public access for all.

10 Responses leave one →
  1. Alan Sugarman permalink
    August 14, 2009

    How many of the 238,000 Southern District of New York documents are judicial opinions?

    and how many of those opinions are ocr’d and searchable?

    and what period of time is posted?

    and for that period of time was the collection 100% of all cases available at the time of collection?

    I would guess the answer in part is:
    .05% of the documents are opinions.
    They are not searchable cause the sdny does not ocr most of its opinions.

    Alan

  2. August 15, 2009

    Alan,

    You are right that the majority of the documents are briefs and other materials, simply because that’s the majority of documents on any given docket. I’ll post more about the coverage period as we finish uploading the archive, but SDNY and others cover quite a range.

    If the originals are not OCR’ed (or don’t include a text layer), they will not be internally searchable (eg: via Ctrl+F in Acrobat). SDNY at least claims to be in compliance with the E-Government Act requirements to make opinions searchable (whether or not internal searchability satisfies the requirements of the Act is another matter).

    At some point in the future, we are looking to OCR the full archive ourselves.

  3. donald permalink
    August 16, 2009

    Is there a reason why Google Analytics and Gravatar is included in this site? Possible to change it to http://piwik.org/ so no data is send to Google?

  4. August 16, 2009

    Steven

    Sorry for what follows, but as I mentioned to you before, I have had past experience with the implementation of corporate information systems including litigating the results of projects never tested prior to roll-out; quality control, testing, and auditing are part of the normal process of rolling out a product, even a beta.

    So, I decided to audit the claim of “238,098 documents from the Southern District of New York” and used the CM/ECF written opinions reports for the months of December in 2005, 2006, 2007 and 2008. I checked to see whether your RECAP flag was “on” for any of the opinions on the reports. There were no flags shown on the written opinions report, but, to be sure, I then did a test by paying to download a docket report for some of the cases on the written opinions reports. In none of my samples did I find that the opinion was uploaded.

    As a further test, I did upload via RECAP one document, and it did then appear with a RECAP flag in the docket report: gov.uscourts.nysd.267689.10.0.pdf. Since, you guys have persisted in refusing to use the lingua franca of the legal world in identifying court documents with the docket number, I then renamed the file on my system as 2005-01-05 SDNY-1:05-cv-04628-DE-10.pdf. With that file name, you will see how easy it is for you or anyone else to now go on NYSD’s CM/ECF and run a docket report and confirm what I have just said (something that cannot be done with the RECAP file name). Also, as you know, it is free for someone to download this opinion document directly from CM/ECF as long as they have a Pacer or CM/ECF password. It is free!! Let me repeat. It is FREE NOW. Here is the link directly into CM/ECF: https://ecf.nysd.uscourts.gov/doc1/12703000666.pdf. [As we all know and discussed, the 127 here indicates a case from the SDNY!!]

    So, my original question was – what is the coverage of these 238,000 documents – what is the time period? I need to know this to avoid a legal malpractice claim where I ever to use your resource or one derived from your resource to do anything meaningful.

    As to what these documents are in this assemblage, I do not even think they are mostly briefs – I wish they were. And, if one were researching only case law, and if the opinions buried in RECAP were searchable via Google, how would one separate out the good stuff from the irrelevant.

    But, in any event, you need to inform your affiliates who have engaged in the predictable vaporlaw, a practice you heard me object to in my recent AALL presentation. I refer to the implication at “RECAP, a Firefox plugin that frees US caselaw one page at a time” http://www.boingboing.net/2009/08/15/recap-a-firefox-plug.html. The implication in this article by Malamud and Doctorow is that 20 million pages of case law are in the RECAP documents. This is not helpful nor is it true, and, it seems there is a lot more free case law on PACER CM/ECF (and on websupp.org) for the identified courts than in this RECAP/Malamud project.

    In the meantime, apart from the wholly counter-productive file name and the absence of metadata, I commend you and your colleagues for a brilliant concept and programming. I really like the way you flag the CM/ECF docket sheet with the R symbol.

    However, this is really pre-Beta. I am sure you will get a lot of plaudits from Bloggers who do not test this out, but, you need more than plaudits.

    As a contrast, I believe that websupp.org (which has collected many judicial opinions from CM/ECF) is far more sophisticated from an information content point of view, though your computer programming is far more sophisticated than theirs. But, I would choose their content approach as a contribution to legal research and access to the law – their file name has utility and they pack the metadata from the docket sheet into the Acrobat pdf properties.

    Content rules.

    In computer science, your team get an A; but, in access to the law, well, it is not an A grade at all.

    Also, too bad this is in Firefox. I would guess that 95% of litigators use Outlook. E-mail is how the litigators receive their free look at the CM/ECF documents. They get a link in their Notice of Filing e-mail giving them one free look. Most using Outlook use IE as their browser. [And, it seems maybe that you have disabled RECAP when the free look is exercised for some unstated reason.]

    Anyway, to get this idea to work, you are asking the federal litigators to load up Firefox, perhaps switch their default browser to Firefox, and then save the document with a file name which is completely lacking in meaning to the litigator, adding more work for the litigator. Why would they do this??

    Yours is fantastic technology – let’s make it practical and do practical things like auditing the written opinion reports from the courts, a practical project you started but set aside for this headline grabbing project, which has fewer (if any) short term practical implications for “freeing the law.”

    Of course, the only way you can conduct this written opinion report audit is to use the docket number which appears in Westlaw AND in the CM/ECF report, but you and your colleagues chose to eschew the docket number for computer science expediency. This of course proves the high information content encompassed in a docket number.

    And remember Gresham’s law.

    Alan

    sugarman@sugarlaw.com
    http://www.hyperlaw.com

  5. August 16, 2009

    Hey Donald,

    I disabled Gravatar. I hadn’t realized that it was being triggered on these comments. We’re sticking with Google Analytics for the time being (as noted in our Privacy Policy), but I’ll take a closer look at piwik. Thanks for the suggestions!

    Cheers,
    Steve

  6. August 16, 2009

    Alan,

    Thanks for pushing on these questions. I’d encourage you to add your suggestions to our new feedback system so that others can vote for them or comment on them as well:

    http://recapthelaw.uservoice.com/

    If you’d like to conduct a comprehensive audit, you can simply download the SDNY tarball yourself and look at the dates. All numbers we cite are accurate. The existing “free” opinions on PACER (spotty as they are, as documented by both you and I) have not been our focus… because they are already free.

    In future versions we’ll add support to optionally use docket numbers instead of the internal PACER ids (and perhaps customize the filename in other ways that make more sense to lawyers). As I have mentioned to you in the past, the main file naming scheme cannot use non-unique values like docket number (sadly, PACER is not well enough architected to make sure that they are unique). However, we can give end users the option to tweak the filenames as they wish, as well as very rich metadata so that systems engineers can easily make the translations themselves. We’d love your input on this process.

  7. August 17, 2009

    Stephen

    Thanks for your response. I have already added 3 suggestions and will provide more later. I also added a description of RECAP on http://www.hyperlaw.com in which I describe RECAP as brilliant.

    I do think, though, that if you do not know what is in your collection of documents, you should downplay the hype. I am not at all sure that the “For example, if you practice law in New York City, you’ll be happy to know that we have 238,098 documents from the Southern District of New York” – uh, seems like it is not worth much at all based upon my test. If you are to be an information vendor, you should know what you are vending. That is not up to me. The data you have is in reality test data and leave it at that. But. if I did my sample correctly, something seems odd as to your claim. Did I make a mistake in my sample showing no documents in December of 2005, 2006, 2006, and 2008 – was my sample insufficient?? If you cannot answer this, please take out the hype. Yes, I know the newspaper articles will not read as well, but that is not your goal.

    RECAP is great – fantastic – super – without any pre-loaded data and you may have raised expectations. Indeed, the desire to “search” comes from promoting the number of documents. So, you have raised an expectation you do not wish to satisfy now.

    As to the file name, apart from my concerns of having “bad law” circulate with meaningless file names and no embedded metadata (yup, I can put it in, but, most people will not), I would bet that 99% of litigators would be ecstatic to be able to save files with the docket number in the file name. I would be and I can name 6 others. This is why the websupp.org methodology is superior to what you do (they could be mixed).

    I have this suggestion – append the docket number to the end of your file name. In any parsing you are doing, you can just match within a string, and not match the entire string, or strip out the last characters, or use any of many other methods. I think you will find that attorneys will be much more enthusiastic about using your system – and may even induce them to load Firefox – and any versions making their way onto the Internet will have this key piece of information – at the minimum.

    As to metadata, I loaded a suggestion because you left out a lot of important case metadata. And, the document description is only in the docket xml – I would think one would want this in the document xml as well.

    Alan

    • August 17, 2009

      @alan

      if I did my sample correctly, something seems odd as to your claim. Did I make a mistake in my sample showing no documents in December of 2005, 2006, 2006, and 2008 – was my sample insufficient?? If you cannot answer this, please take out the hype. Yes, I know the newspaper articles will not read as well, but that is not your goal.

      Your mistake was in assuming that the WrtOpRpt.pl was being parsed in order to provide Recap links. It is not. This is on our todo list but, as I mentioned, providing a free RECAP link when the original document is already free is not a very high priority.

      As to the file name

      As I mentioned, we have code which is close to being ready which lets you assign other elements to the filename. We had not released it yet because we were testing it more with real lawyers. If you would like to try a pre-release of this functionality and give feedback, let me know.

      append the docket number to the end of your file name

      Let me think this one through. There is some appeal to this approach, but I also envision some disadvantages. I’m happy to discuss this over email if that’s a better venue.

      As to metadata, I loaded a suggestion because you left out a lot of important case metadata.

      Yes, there is some metadata we should add. I’ll comment in more detail over at your suggestion, but I would just clarify that you want to be looking at files like this and not this (the latter are not intended to hold complete metadata… or really to be used much at all).

      One thing we do need to do is better document our metadata structure.

  8. August 17, 2009

    steve PERMALINK
    @alan

    STEVE SAID:
    Your mistake was in assuming that the WrtOpRpt.pl was being parsed in order to provide Recap links. It is not. This is on our todo list but, as I mentioned, providing a free RECAP link when the original document is already free is not a very high priority.

    ALAN RESPONDS
    I did not make that assumption. I new it was not being parsed – though, it does seem to attempt to upload the file. What I did in order not to be wasting money uploading stuff was to use the written opinions reports to identify documents that were filed in the SDNY on CM/ECF in the months of December for 2005, 2006, 2007 and 2008 – assuming that some of your 238,000 SDNY would have picked these documents up. Any documents on this report that I uploaded from the docket sheet would be free. So, I identified the document and the ran a docket report for the time period so I would only have to pay for 2 or 3 pages of the docket report. I also checked to see if the file was already one of the 238,000 file you hyped. None were. To verify the functioning of RECAP, I then uploaded a free opinion file from the docket sheet (that is why you saw it in my example). But, in my test, when I looked at the docket report for a case, I did not find either the opinion or any other documents uploaded. Now, I may have made a mistake and invite you to replicate my test.

    As I said, this is your responsibility to tell me what is in your data, not the other way around!!!

    STEVE SAID
    As to the file name
    As I mentioned, we have code which is close to being ready which lets you assign other elements to the filename. We had not released it yet because we were testing it more with real lawyers. If you would like to try a pre-release of this functionality and give feedback, let me know.

    ALAN RESPONDS
    Well, I am a real lawyer who practices in federal court and would be willing to check the functionality. The other real lawyers who are federal litigators with whom I spoke had little interest in loading Firefox, and no interest when I told them the docket number would not be in the file name.

    So, I will test it.
    But, before you release it, I would suggest you do a posting and list a variety of file name formats and get some feedback as to the ways in which the name would display. Part of the problem of course is the blank title field in the acrobat file which is going to result in some unfortunate search results if the documents are posted and available for search by Google. Does adding metadata mess up the hash – guess it does.

    Also, the other real lawyers I spoke to had even less interest when I told them that RECAP would not save the free look from the e-mailed Notice of Filing since they would be signing in with their ECF password. Am I correct in this?

    STEVE SAID RE (append the docket number to the end of your file name)
    Let me think this one through. There is some appeal to this approach, but I also envision some disadvantages. I’m happy to discuss this over email if that’s a better venue.
    ALAN RESPONDS:
    Duh – you guys are too good at programming to be unable to find a solution. These files are going to end up unchanged on the Internet – no doubt about it. So, let’s make them useful for searching.

    Sure – e-mail me to discuss.

    STEVE SAID:
    Yes, there is some metadata we should add. I’ll comment in more detail over at your suggestion, but I would just clarify that you want to be looking at files like this and not this (the latter are not intended to hold complete metadata… or really to be used much at all).

    ALAN SAID:
    I did see all of the metadata. When I prototyped this two years ago, I captured info such as the case type. It was useful to search on “nature of suit” such as “insurance”. Even the much maligned CM/ECF system allows this type of search. You should capture case type, nature of suit, cause, jurisdiction, case flag.

    As to the information on your docket report as to each document (the rows), it seems to me that logically, those should be in the xml file for the document as well. That is document related information. You did not do this because when you have a docket sheet, you do not necessarily have each document on the docket sheet.

    STEVE SAYS
    One thing we do need to do is better document our metadata structure.

    ALAN RESPONDS
    Well yes. It would be interesting to see a comparison between your xml files and those used by alt.law and by the Federal Reporter site. An careful syntactical and logical analysis would be useful. I do recommend thinking in terms of type of court, political jurisdiction, name of court, level of court etc. You really need a structure that would work say for an administrative agency of the city of new york. US-NY-NYC-AGENCY-BSA identifies and agency known as the New City Board of Standards and Appeals. What would an XML file for the FCC look like? There needs to be a logical structure and data purity in the same way one wants data purity in a database structure. And I still do think using gov.uscourts in misleading.

    Alan
    ps Anyone reading this should know, as Steve knows, that my comments are not personal and many of these thoughts and criticisms have been applied to others in the past.

    • August 17, 2009

      I think we’ve taken this as far as possible in the comments. Let’s continue offline.

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS