How to “clean” and analyse data using Exploratory, Openrefine, Rstudio, & Jupyter
(!Work in progress, perpetual alpha … use these tips at your own risk)
What about it?
I am going to write a series about my data analysis and writing workflow. These workflows use free and open source tools to clean, analyse, and write research. I want these to be reproducible.
I work with csv data sets, and I use OpenRefine to “clean” data sets. I use Exploratory for exploratory analysis of the data (with some modelling), and R for data analysis. Stata is the only other non-free software I use. I also use Pandas/Statsmodels/Matplotlib on Python in a Jupyter Notebook to analyse data and graphical visualisation. I keep my bibliographic database in bibtex. I write the paper in Jupyter using Markdown format, and for final production of the paper, I convert them either as markdown or LaTeX, using Pandoc. I use Overleaf and Authorea for finishing.
I work with colleagues who only use Microsoft Word & Endnote & SPSS for data analysis. I have found that converting the files to Word document and sharing with them over Online Word works best. I also like to keep my papers in pre-print servers. Both Overleaf and Authorea now support to pipe your papers to preprint servers (my favourite is PeerJ). You can mint DOIs using Figshare, and keep your data there. I keep online spreadsheets using Airtable.
Can Medium be a medium for Scholarly Writing?
I am going to write about these workflows in the following series of posts on data analysis & writing. I like Medium to be a channel for scholarly writing. At the time of writing this series, Medium allows embedding of spreadsheets in the form of airtable embeds and shows graphics (including interactive graphics using Plot.ly), but otherwise use images to present tables. You can use endnote/footnote styles (superscripts) to write references (type out the references), but beyond this, it does not allow for minting of DOIs, so using Medium as a portal to keep your scholarly publication or preprint is not possible at the moment. Otherwise, Medium would be an ideal platform for scholarly communication and peer review as it allows for different levels of commenting and feedback.
Start with OpenRefine
Read the data by first saving the data into csv file. Get the data in the form of excel spreadsheet, read it in your favourite software and then use exploratory to read the data. Exploratory works well with the columns where you can either delete unwanted columns, or rename them.
Data Preprocessing: OpenRefine
Remove rows and columns
Removing rows and columns that contain irrelevant data are the first couple of steps in data preprocessing. As OpenRefine works on columns, you can select a column (right click on it), and select to remove the column.
If you want to remove one or more row(s):
- Star it (or the number of rows you want removed)
- From “All” column, select “facet by star”
- In the resulting window, you see “true” and a count (can be “1” if only one row, otherwise n where n = as many rows as you want removed)
- In the “All” column, click “Remove all matching rows”
- This will remove all the “starred” rows, and if you click on the “False” link, you will see the remaining rows in the right hand panel (See Figure 1)