Share this content

Google tackles "messy data" with new tool

12th Nov 2010
Share this content

Google has rebranded the Freebase Gridworks software it acquired as part of its Metaweb purchase, with the newly launched version of the open source data cleaning project now dubbed 'Google Refine'.

The vendor purchased Metaweb in July with the aim of integrating its Freebase database of information about 12 million real-world entities such as movies, celebrities and locations into its online search engine in order to improve the quality and accuracy of such searches.

But the Freebase Gridworks/Refine open source project, which was set up by Freebase developers David Huynh and Stefano Mazzocchi, came along as part of the package too.

Google describes it as a "power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format to another and extending them with new data from external web services or other databases".

In other words, the Java-based software enables users to identify and eradicate inconsistencies such as different spellings of ‘grey’ in a given dataset in order to make it easier to undertake data aggregation and analysis.

While in the past, such a scenario would have involved writing specific software code for each data set, Gridworks/Refine does not require any coding at all and can be used with any data set, which should save users time and effort.

The technology is already being used by, the Chicago Tribune and online news service, ProPublica.

Refine version 2.0, meanwhile, includes such new functionality as an "extensions architecture", a reconciliation framework for linking records to other databases such as Freebase and a range of new transformation commands and expressions.

Replies (0)

Please login or register to join the discussion.

There are currently no replies, be the first to post a reply.