Establishing relationships between entities using unstructured data presents a unique challenge. How do you determine if a person’s name that appears in two separate sources is the same person? How do you figure out if, and how, that person is related to other people or companies in those or related documents? Are there different types of relationships? Can you strengthen or weaken those connections by introducing more data?

Your Challenge

Identify, strengthen and classify types of relationships between entities found in the data sets listed below. Use additional data sources to uncover connections between entities in the data set that may have no apparent connection within the data sets listed below.

Data Sets

Come to our booth to download copies of our two datasets, or download from the following:

1. Public court cases:

This data set can be downloaded from the CourtListener website, which is maintained by the FreeLaw non-profit project. The data set contains court opinions, judges, dockets, courts, plaintiffs and defendants in JSON format.


2. IAPD data set:

Publicly published by the SEC, this data set lists investment advisors, their companies and public disclosures made about them.

The court data set can be viewed as a social graph, where the socialization takes place in legal settings between courts, judges, plaintiffs and defendants whether they are people or companies. What interesting information can you find in this data set? Can you find non-obvious relations between people and people, people and companies or between companies?

The Investment Advisors Public Disclosures data set, made publicly available by the SEC, is more structured. It contains names of advisors and their companies. It also some public disclosures made about them. Can you find related and interesting information between both data sets? Can you use additional data sets to build connections that aren’t apparent between the two data sets?


Suggested open source tools

1. Apache Stanbol (we highly recommend you start with that library first unless you have previous experience with the other two)

2. Stanford Entity Resolution Framework

3. OpenNLP