TY - JOUR
T1 - Synthesizer
T2 - Expediting synthesis studies from context-free data with information retrieval techniques
AU - Gandy, Lisa M.
AU - Gumm, Jordan
AU - Fertig, Benjamin
AU - Thessen, Anne
AU - Kennish, Michael J.
AU - Chavan, Sameer
AU - Marchionni, Luigi
AU - Xia, Xiaoxin
AU - Shankrit, Shambhavi
AU - Fertig, Elana J.
N1 - Publisher Copyright: © 2017 Gandy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2017/4
Y1 - 2017/4
N2 - Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https:// github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.
AB - Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https:// github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.
UR - http://www.scopus.com/inward/record.url?scp=85018585958&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85018585958&partnerID=8YFLogxK
U2 - https://doi.org/10.1371/journal.pone.0175860
DO - https://doi.org/10.1371/journal.pone.0175860
M3 - Article
C2 - 28437440
SN - 1932-6203
VL - 12
JO - PloS one
JF - PloS one
IS - 4
M1 - e0175860
ER -