Collating Network Data
Problem being addressed
Network data is becoming more and more prevalent. The trouble is to find the "true" underlying network structure when there are multiple network data sources present. What techniques would help?
The authors introduce a framework called which is applied iteratively to find the true network structure in the data. The goal of the framework is, given multiple input network data structures that say the same thing in different ways, to produce an output network data structure, e.g. entities with many labels like John Smith or J. Smith. There are three specific problems tackled (1) entity resolution (2) link prediction and (3) node labelling. Entity resolution aims to correctly resolve nodes in the network when there are multiple labels present. Link prediction aims to correctly infer the relationships between the nodes. Node labelling is about figuring out the role of the node in the network, e.g. CEO or Manager. To solve the problem, the framework uses a weighting algorithm that infers the best output graph against a list of constraints.
Advantages of this solution
This is an example of a data preprocessing algorithm that can help resolve multiple sources of data into a singular network data set. By implementing this algorithm as part of a data clean-up step, a data scientist has a better chance of getting accurate results further down the line.
Solution originally applied in these industries
Possible New Application of the Work
The entertainment industry has a number of different graph data structures that are used. It could be useful to find a "master network" which implements this kind of resolution in order to predict, for example, movie success based on the actor/director/producer and their network embedding.
Ecologists often talk of web dynamics that govern complex systems like forests. It could be useful to build up network structures for multiple different ecosystems, and then attempt to resolve these systems into a singular network based on ecological roles or ecosystem services. For example, similar ecosystem services for the network may be done by brown bears and wolves in different areas.
Banking networks, particularly debt resale networks, are hard to resolve in practice. This is mainly due to the opacity of credit swap contracts. However, in order to understand systemic risk in banking, it would be worth trying to resolve these types of networks into a single entity by trying to collate the (sparse) available data.
Source DOI: #############
Source URL: #############