USPTO and Reed Tech Public Dissemination of Data Contract Update
After seven years of service, the Public Dissemination of Data (PDD) contract between the United States Patent and Trademark Office (USPTO) and Reed Tech will officially end on June 25, 2020.
Sources of Patent Data
The USPTO bulk data-sets for grants and applications come in several versions including: only PDF files, full-text (with and without TIFF images/drawings), and only bibliographic front-page data. The USPTO Gazette bulk files contain notices in each issue which provide important information and changes in rules concerning both patents and trademarks. The USPTO Cancer Moon Shot data-set is a collection of consists of 269,353 selected patent documents with the purpose to reveal new insights into investments around cancer therapy research and treatments and increase the pace of cancer research. A USPTO bulk-data parser that will download and parse the data files into a normalized MySQL or PostgreSQL database is available here: https://github.com/rippledj/uspto
- Official Gazette for Patents (2002 – present) (Description)
- Patent Grants bulk data (1976 – present) (Visit)
- Patent Application bulk data (2001 – present) (Visit)
- Patent Assignment data (2014 – present) (Visit)
- Patent Assignment Economics Data (Visit)
- Office Action bulk data (2008 – 2017) (Visit)
- Patent Maintenance Fees (2008 – 2017) (Visit)
- PAIR bulk data (2014 – 2017)(Visit)
- Cancer Moonshot data (1976 – 2016) (Visit) (Data Description) (Project Description)
- Historical patent data (1840 – 2014) (Visit)
The USPTO API Catalog includes several API formats that vary slightly but are based on Apache Solr API syntax (Documentation). The some of the APIs are documented well with examples, however, some are not.
- IP Marketplace Platform API (Visit) (Data Description)(Web-portal)
- USPTO Office Action Text Retrieval API (Visit) (Data Description)
WARNING: Read the Google BigQuery pricing documentation before using it. You can accumulate costs very quickly.
Google’s Big Query is a paid option to access highly accurate up-to-date USPTO Patent data. You will need to create a Google billing account, create a project, and add BigQuery permission to your project to get full access. However, Google offers a free trial with $300 of credit and 10GB of free storage, and Google BigQuery is included in Google Cloud Platform (GCP) free tier, so so as long as you stay under the limits, you can use BigQuery for free. However, be warned that a single query on the USPTO public data database will consume ~1.5 tebibytes of data which is more than the 1 tebibyte allowed limit of the free tier.
That means that a each query to the database and will therefore cost you around $50.
Once you understand the GCP billing options, you can select a package that suits your needs and optimize your queries, it is possible to achieve a cost effective cost profile, but be aware that BigQuery does not operate the same was as Google cloud billing works, and USPTO public patent dataset is a large database (~2 TB) which factors into the costs. Also, You can test your queries using the web-console but cannot download data. Finally, Google BigQuery also offers a great API which can be used to build functionality into your software, and a non-official GitHub repository with some examples of useful applications.
Here is a shortlist of Patent Datasets available via Google BigQuery
- Google Patents Public Data (View) contains worldwide bibliographic and US full-text dataset of patent publications.
- Patent Examination Data System (PEDS) (View) contains data from the examination process of USPTO patent application bibliographic, published document and patent term extension data tabs in Public PAIR from 1981 to present.
- The Office Action Research Dataset (View) contains detailed information derived from the Office actions issued by patent examiners to applicants during the patent examination process derived from 4.4 million Office actions mailed during the 2008 to mid-2017 period from USPTO examiners to the applicants of 2.2 million unique patent applications.
- Patents View Data (View) longitudinally links inventors, their organizations, locations, and overall patenting activity.
- The World Bank World Development Indicators Data (View) contains the primary World Bank collection of development indicators, compiled from officially-recognized international sources.
- SureChEMBL Data (View) contains compounds extracted from the full text, images and attachments of patent documents.
The Word Intellectual Property Organization (WIPO) has many IP database resources listed on their website and all their resources are free. The list below is not exhaustive of WIPO resources, but includes the most fundamental and most intersting items.
- PATENTSCOPE (View) (Data Coverage)(Sample) contains search 90 million patent documents including 3.9 million published international patent applications (PCT), PDF, XML, TIFF, and HTML documents including bibliographic (front-page) data, description, claims, notices, and application status. Machine translation is available to translate between many international languages.
- International Patent Classification (IPC) Web-portal (View)
- Locarno Classification (View) international classification used for the purposes of the registration of industrial designs.
- IP laws and treaties (View) contains full texts, summaries and membership of the international IP treaties administered by WIPO.
- Global Brand Database (View) (Data Description) contains more than 42,240,000 records from some 55 national and international collections of trademark and brands.
For the most part, Patents View datasets are simply official USPTO bulk-datasets converted into TSV and organized for your convenience. A document with descriptions of all the available datasets is available here. However, there are several datasets that have been compiled by the team at Patents View that are unique and potentially very useful to your project.
GitHub repositories with Python and R scripts to assist with the import and analysis of the data are available making it very easy to build a local database using the Patents View provided TSV files. Also, a prototype API , web-console search UI, and many interesting data visualizations are also available.
Finally and possibly most interesting contribution of Patents View to patent bulk-data is their process of Assignee Disambiguation:
The PatentsView data generation process does not fully disambiguate the names of assignees. The University of Michigan’s STATA Utilities(1) are initially applied to raw assignee names to correct minor typos and misspellings. The Jaro-Winkler(2) string similarity algorithm is then applied to each pair of processed assignee names to disambiguate records. In other words, processed assignee names that are within a certain bound of similarity are considered the same and are linked together.
Although, the process of disambiguate and consolidating assignee names is not perfect, it makes a the task much simpler. In the assignee file the following records are available when text-searching for “Google”:
- Google Technoogy Holdings LLC
- GOOGLE TECHNOLOGY HOLDINGS LLC
- Google Inc.
Below is a shortlist of interesting data offered by Patents View, while the full list can be found here:
- Gender assignment of disambiguated inventor (TSV) (Data Description)
- Lookup table of current CPC groups (TSV)
- Lookup table of WIPO technology fields (TSV)
- EP bibliographic data (EBD)
- European Patent Register data
- EP full-text data
- EP full-text data for text analytics
- EPO worldwide bibliographic data (DOCDB)
- EPO worldwide legal event data (INPADOC)
- Sequence listings
- National full-text data
- Decisions of the EPO boards of appeal
However, there are many other data available from the EPO:
- PATSTAT (Visit) a robust statistical analytics engine which allows you to retrieve data using SQL, and customize, display, visualize and download results. You can try the service with a one-month free trial subscription.
- European Publication Server web service (Visit) enables access to weekly updated data (publication dates, weekly patent lists, documents as XML, HTML, TIFF or PDF files, raw data) and documents XML, HTML, TIFF images, and PDF/A versions of European (EPO) A (applications) and B (grants) publications. The documentation provided describes the JSON RESTful API syntax and how to register a user-agent with the EPO data-service. Once you have registred a user-agent, you can simply query the API free of charge.
- European Patent Office OPS (Visit) also offers a RESTful API for its patent data designed to allow clients to access global coverage of patent (as opposed to European Publication Sever web service strictly EPO patent data) data for use in their own products and applications. Basic access to the API is free. However, annual subscription fees are required if you need more than 4 GB of data per week. Authentication is handled using OAuth to track your data usage, and documentation is available at the EPO Web-services page under the Further Information tab.
- Espacenet patent search (Visit) (Documentation) provides a web-portal that contains worldwide coverage and search features, with free access to information about inventions and technical developments from 1782 to present. The largest public patent database on the internet, Espacenet contains 100 million patent documents from more than 90+ countries including Canada and the U.S., plus 2 million records for non-patent literature cited in EPO search reports.
- Online training (Visit) provides access to downloadable learning materials, videos, quizzes, live webinars, forums and much more. Many of them are directly accessible, i.e. you don’t need to be registered in the platform.
Google Patents – Document Number Web-formatted full-text data (Abstract, description, claims, references, etc. Basically USPTO Bulk data XML data.)
Derwent LitAlert contains records for patent and trademark infringement lawsuits filed in the ninety four U.S. District Courts and reported to the Commissioner of the United States Patent and Trademark Office (USPTO). The data is provided by Thompson Reuters Westlaw directly to Clarative Analytics who produces the LitAlert database.
If you have an academic licence through your university, you may access the LitAlert database for free through your university’s portal. This may be through Thomson Reuters’ Web of Knowledge. However, if you do not have a special academic licence you should contact Clarative Analytics to get information about licencing fees and online access to the data. The FAQs page at Clarative Analytics provides a good description of the dataset. If you want to get more information about a case you can use a case law system such as Westlaw or Lexis to download the full document.
Search 2 million Canadian patents from 1869 to the present. Full-text documents are available from 1869 forward. This database is produced by the Canadian Intellectual Property Office (CIPO) and is updated weekly. Access to downloadable bulk datasets is free. The bulk-data products and web-search portals are listed below:
- Patent data: Bibliographic and full text (XML) (Visit)(Data Description)(File Index) contains weekly update zip files for the past two years and a historical bulk data files with 2 months per file.
- Basic keyword web-search portal (Visit)
- Advanced keyword search (Visit)
- Patent number search (Visit)
DEPATISnet (DPMA): https://depatisnet.dpma.de/DepatisNet/depatisnet?window=1&space=menu&content=index&action=index
DEPATISnet is produced by the German Patent and Trademark Office (DPMA). It covers patent documents from more than 90 countries including Canada and the U.S.
JP-PlatPat (JPO): https://www.j-platpat.inpit.go.jp/
The official databases of the Japan Patent Office (JPO). Search Japanese patents, utility models and designs from 1922 to the present.
U.S. Patent Assignment Database (USPTO): http://assignment.uspto.gov/
This database contains recorded U.S. patent assignment (ownership) changes from August 1980 to the present. It is updated daily.
ViDoc (IMPI): http://vidoc.impi.gob.mx/
The official patent database of the Mexican Institute of Industrial Property (IMPI). In addition to applications and granted patents, ViDoc includes utility models and designs.
Indian Patent Office
European Patent Register