Data Science with Linear Programming
The standard process of data science tasks is to prepare features inside a database, export them as a denormalized data frame and then apply machine learning algorithms. This process is not optimal for two reasons. First, it requires denormalization of the database that can convert a small data problem into a big data problem. The second problem is that it assumes that the machine learning algorithm is disentangled from the relational model of the problem. That seems to be a serious limitation since the relational model contains very valuable domain expertise. In this paper we explore the use of convex optimization and specifically linear programming as a data science tool that can express most of the common machine learning algorithms and at the same time it can be natively integrated inside a declarative database. We are using SolverBlox, a framework that accepts as an input Datalog code and feeds it into a linear programming solver. We demonstrate the expression of three common machine learning algorithms, Linear Regression, Factorization Machines and Spectral Clustering, and present use case scenarios where data processing and modelling of optimization problems can be done step by step inside the database.