Data Autonomy and Open Source: a match for transparency
Date: | 14 August 2023 |
Author: | Daniël Vos |
Introduction
Data autonomy can be understood as the process of gaining control of the information we collect and produce. By regaining control, we will be able to decide which data goes where and who can see what. This increases transparency around the use of data. A commonly voiced solution for achieving this level of control and transparency is pursuing an organization-wide Open Source strategy. In this blogpost, I will explain how Open Source relates to data autonomy and what is required for it to work.
Current situation
We are waking up and going to bed with Google within the University of Groningen (UG). Google services are fully ingrained in the workflow of students, researchers, and staff. This means that all the data we produce in our day-to-day business is being created and processed by a private external actor, usually Google. This begs the question what is happening after proprietary software suppliers collect data on (y)our behaviour, content, and other personal/organisational information.
The answer to this question is not straight forward. Proprietary software is usually ‘closed source software’, which means that the human-readable source code is not openly accessible. Thus, as an organization the UG must trust that what companies such as Google promise is actually being done in practice. For the UG it is difficult to see which information is actually being collected, how data is being processed, where it is being stored, and how data might be aggregated or re-used for other business purposes of companies such as Google.
Open Source and data autonomy
To increase transparency and control, Open Source software is often being promoted since the philosophy behind it fully supports those objectives. Open Source software is developed with freedom in mind, meaning that it needs to be freely accessible, modifiable, and distributable. In other words, it gives the user/organisation agency the freedom to influence how it works in general, what the software is going to be, which data it collects and for which purposes.
Having access to the source code and knowing how the software works is one side of the data autonomy puzzle. The data that is collected by that Open Source solution needs to be stored somewhere, therefore some form of storage solution is required (e.g. locally on premise or in the cloud). With an Open Source solution you are free to pick and choose your storage solution, which is a huge benefit compared to proprietary software. For proprietary software you are typically required to use the cloud solution that they offer, for instance Azure and OneDrive for Microsoft and the Google Cloud for Google. However, with more freedom of choice in where to store the data, there also needs to be a consensus on what that storage solution should minimally offer (encryption, ease of migration, access management, …), therefore some form of minimal standard needs to be developed for storage compliant with the data autonomy vision of the UG.
In consequence, we need to develop a measurable metric for data autonomy, which we can use to select and compare different software packages, as they are available Open Source or as proprietary options. Parties involved with supplying Open Source metrics – such as CHAOSS – have yet to touch on the subject of data autonomy. This could be an opportunity for the UG as an internationally acknowledged knowledge institution to step in by developing such a metric.
Just talking about wanting to implement Open Source solutions is not enough, to actually implement it effectively the UG needs to have the capacity to maintain and develop the Open Source packages. Which will require first a mapping of the current capacity (e.g. which skills do we possess and which packages are currently maintained, integrated, or used). If there exists a skill gap the UG can train the necessary skills within the IT department or hire the required developers. By making the current Open Source usage visible there could be collaborations on similar applications between domains and faculties. If we do not map our current capacity regarding Open Source, we could still be tied down by support contracts with commercialised Open Source platforms which goes against the data autonomy vision.
Conclusion
Being self-sufficient by relying on Open Source is possible. However, it requires considerable investment in re-training and re-thinking on how we organize the UG as a datafied and data-dependent organization. This is why the newly established Open Source Program Office (OSPO) at CIT is supporting the data autonomy initiative. Making the UG increasingly ready for Open Source solutions is one of the key deliverables of OSPO. Therefore, I make an appeal from both the OSPO and the data autonomy project to get involved and think along, since we are at the frontier of big changes. Your input will be invaluable to its success!