Aiding polar research, from the eye of a data scientist

Although polar week is over, #PolarWorld never ends.

“We are drowning in information but starved for knowledge” — John Naisbitt

Immense amount of data were generated and published by polar researchers. What do we data scientists do about it, without going to polar regions?

Read further for some insights shared by Yi-Ming Gan from the Royal Belgian Institute of Natural Sciences.


Data standard facilitates data sharing

Say you need A4 size papers for your printer. Any A4 paper will fit into your printer regardless of their brands, their colors… Because all A4 papers are compliant to A4 size standard and any printer which uses A4 papers can print on them. It is the same for data standard.

Darwin Core standard is a data standard established by TDWG which uses standardised format and terminology for data exchange. All biodiversity data published in darwin core standard can be shared among any databases/softwares compliant to darwin core standard. For instance, I am working on the upgrade of  Antarctic biodiversity data portal which uses darwin core standard and we are able to import data from Global Biodiversity Information Facility (GBIF) into our database. This means that anyone visited our data portal or GBIF can see your works!

Errors in data semantics are common

It is not uncommon to encounter dataset which lacks a negative sign in latitude/longitude or inverted longitude/latitude when specified a coordinate or recorded coordinate in degrees minutes seconds instead of decimal format.

To ensure the accuracy of data deposited in our Integrated Publishing Toolkit (where many antarctic datasets are deposited) data cleaning is necessary, that is where we come into play. We use tools such as WoRMS, LifeWatch data services, Quantarctica and other tools to verify if taxa rank, coordinates and other components of the datasets are correct. Sometimes I wanted to give a shout out to fellow researchers who provide high quality research data because they make our lives so much easier! Kudos!!


In short, Integrated Publishing Toolkit (IPT) is a free and open source software to publish and share biodiversity datasets through GBIF network. There are many IPTs out there in many different countries.


Not unexpectedly, no data cleaning procedures is perfect. Sometimes it is simply impossible to verify the semantics of data. For instance, we could never verify if the date 11/12/10 is in year-month-day, month-day-year or day-month-year format without consulting the author of datasets.


“You can have data without information but you cannot have information without data.”

–Daniel Keys Moran


Free and open access to biodiversity data, open source software

We are currently in the phase to provide free and open access to antarctic biodiversity data for our project (Antarctic biodiversity information system). We are also working on Register of Antarctic Species with taxonomy experts to provide comprehensive information on antarctic species besides upgrading our data portal.

We strive to provide better user experience in web application design, better way to explore datasets so that more people will know about your works.

We are also developing tools/softwares that users can use to analyze their data. Nevertheless, the applications will be made open source once they are in production!


“For me, open source is a moral thing.” — Matt Mullenweg


Data scientists are supporting polar research in the background

Although you don’t see us in polar regions, we are actively maintaining your data, web applications and databases.

If you are visiting Antarctica soon, why not check out Antarctic Field Guides.

If you have metadata of microbial sequence data, you are welcome to deposit your metadata in Microbial Antarctic Resource System (MARS).