Data Commons (The open knowledge Repository)

Dr. Ramanathan Guha

A famous mathematician and a data scientist enthusiast, Clive Humby, once said that “Data is the new oil”. Indeed, data is valuable and an untapped resource! However, most of the data currently is in its raw format and can only become valuable once it is cleaned to be used for creating insights. To know how Data Commons is trying to achieve this aim, Dr. Ramanathan Guha, who is a Google Fellow, Vice President at Google and also leading the Data Commons project, was invited to deliver the third RBCDSAI Latent View Colloquium talk on “Data Commons (The open knowledge Repository)” on 21st July 2021.

Dr Guha began his talk by explaining the analytic models which were built using mathematical equations and were also used for AI model building in the beginning but could not be used for making AI models in social, behavioural and economic domains due to their complex nature. Next, he discussed empirical models which do not require casual equations, takes the data and fits the curve and became popular due to their various applications, especially in the web ecosystem. While discussing three pillars of machine learning- algorithms, computation and data, he said that machine learning flourishes with more quantity and variety of data which is present with giants like Google, Facebook etc. Talking about the data landscape, he said that tonnes of data are present however one needs to forage these datasets, track down assumptions, clean, normalize, join etc. before using it for getting insights. Performing these operations is costly and requires high computing restricting the use of enormous amounts of available data and therefore a platform like Data Commons was conceived where knowledge graphs have been built utilising large datasets which were cleaned and normalized before the use. The speaker informed that the Data commons team has built an open-source infrastructure for creating, storing and serving these knowledge graphs, which are stored in conventional relational stores. He also mentioned that the infrastructure of Data Commons has been built on schema.org but it has lots of other features that are yet not in the schema. He added that knowledge graphs based on social and biomedical datasets have already been made while the work for developing one on Energy/climate change is underway.

During his talk, Dr Guha showed various features of Data Commons and its use in creating insights and how it is beneficial for students, consumers and journalists. He also mentioned that the data commons team is working on creating an open-source Natural language interface for all Data Commons Data available for use on DataCommons.org and their long term goal is to be able to answer ambiguous and open-ended questions through Data Commons. The talk was well-received by the audience and a number of questions poured in after the talk that sparked a good discussion on the topic.

The video is available on our YouTube channel: Link.