The VERIS Community Database (VCDB)

Information sharing is a complex and challenging undertaking. If done correctly, everyone involved benefits from the collective intelligence. If done poorly, it may mislead participants or create a learning opportunity for our adversaries. The Verizon RISK Team supports and participates in a variety of information sharing initiatives and research efforts. We continue to drive the publication of the Verizon Data Breach Investigations Report (DBIR) annually, where we have an unprecedented number of new data-sharing partners, and we are committed to keeping the report publicly available and free to download. We regularly receive inquiries about our dataset, and our ability to share further, but we are limited in what data we can share in raw format due to agreements with our partners and customers.

Go straight to the data on GitHub

The Problem

While there are a handful of efforts to capture security incidents that are publicly disclosed, there is no unrestricted, comprehensive raw dataset available for download on security incidents that is sufficiently rich to support both community research and corporate decision-making. There are organizations that collect—and in some form—disseminate aggregated collections, but they are either not in a format that lends itself to ease of data manipulation and transformation required for research, or the underlying data are not freely and publicly available for use. This gap has long hampered researchers who are studying the problems surrounding security incidents, as well as the risk managers who are starved for reliable data upon which to base their risk calculations.

Our Contributions to the Solution

To address this problem that has plagued the community, we are pleased to announce the VERIS Community Database (VCDB), which aims to collect and disseminate data breach information for all publicly disclosed data breaches. The data are coded into VERIS format and an available for public use. The initial release had just over 1,200 incidents, primarily from 2012 and 2013. Since then, the dataset has grown to well over 8,000 individual incidents. Data sources include the Department of Health and Human Services (HHS) incidents, the sites of the various Attorneys General that provide breach notification source documents, media reports and press releases. We intend to continue to augment this dataset to capture as many incidents as possible so that others can benefit. Given the initial makeup of the data, care should be taken when basing decisions on it until it has become more comprehensive and representative. The data are currently biased towards the Health sector since HIPAA requirements mandate publicly disclosing breach incidents. Since the start of the project, the regulatory landscape has seen significant change. All 50 of the United States now have some form of breach notification laws on the books, and the privacy regulations around the world have become more robust as well.

The data is available on the GitHub repository. There you will find individual JSON files for each incident in the dataset, including the original URLs we used when coding the cases. The Issues list is the way we track breach reports that have not yet been coded into JSON.

What can you do with this data?

You can “ask” it questions—think of something you’d like to know and start looking into the data to answer the question. Prove or disprove an assumption you have made in your own work. You can make direct comparison between the findings in the DBIR and the public data to see how they differ. You can filter by industry and organization size and see how your organization stacks up against companies of the same size and industry. If you use VERIS in your workplace, you can make comparisons against your own data as well (which is a good reason to look at adopting VERIS). Eventually, this will become a rich, freely available data source for conducting this type of ad hoc research.