Generalized Machine Learning – Kerneml – Simple ML to train Complex ML

Generalized Machine Learning – Kerneml – Simple ML to train Complex ML


There have been a lot of articles recently about a new form of optimization called ‘particle optimization’ or ‘swarm optimization,’ particle optimization with multiple particles. Coincidentally, I recently created a ‘particle optimizer’ and published a pip python package called kernelml. My goal is to eventually make the project open source. This optimizer can be used as a generalize machine learning algorithm for custom loss functions and non-linear coefficients.
Example use case:
Lets take the problem of clustering longitude and latitude coordinates. Clustering methods such as K-means use Euclidean distances to compare observations. However, The Euclidean distances between the longitude and latitude data points do not map directly to Haversine distance. That means if you normalize the coordinate between 0 and 1, the distance won’t be accurately represented in the clustering model. A possible solution is to find a projection for latitude and longitude so that the Haversian distance to the centroid of the data points is equal to that of the projected latitude and longitude in Euclidean space.

 
The result of this coordinate transformations allows you represent the Haversine distance, relative to the center, as Euclidean distance, which can be scaled and used in a cluster solution.
Another, simpler problem is to find the optimal values of non-linear coefficients, i.e, power transformations in a least squares linear model. The reason for doing this is simple: integer power transformations rarely capture the best fitting transformation. By allowing the power transformation to be any real number, the accuracy will improve and the model will generalize to validation data much better.  

 To clarify what is meant by a power transformation, the formula for the model is provided above.
The algorithm:
The idea behind kernelml is simple. Use the parameter update history in a machine learning model to decide how to update the next parameter set. Using a machine learning model as in the backend causes a bias variance problem, specifically, the parameter updates become more biased by iteration. The problem can be solved by including a monte carlo simulation around the best recorded parameter set after each iteration.
The issue of convergence:
The model saves the best parameter and user-defined loss after each iteration. The model also record a history of all parameter updates. The question is how to use this data to define convergence. One possible solution is:
         convergence = (best_parameter-np.mean(param_by_iter[-10:,:],axis=0))/(np.std(param_by_iter[-10:,:],axis=0))
         if np.all(np.abs(convergence)<1):             print(‘converged’)             break
The formula create a Z-score using the last 10 parameters and the best parameter. If the Z-score for all the parameters is less than 1, then the algorithm can be said to have converged. This convergence solution works well when there is a theoretical best parameter set. This is a problem when using the algorithm for clustering. See the example below.
 Figure 1: Clustering with kernelml, 2-D multivariate normal distribution (blue), cluster solution (other colors)
We won’t get into the quality of the cluster solution because it is clearly not representative of the data. The cluster solution minimized the difference between a multidimensional histogram and the average probability of 6 normal distributions, 3 for each axis. Here, The distributions can ‘trade’ data points pretty easily which could increase convergence time. Why not just fit 3 multivariate normal distribution? There is a problem with simulating the distribution parameters because some parameters have constraints. The covariance matrix needs to be positive, semi-definite, and the inverse needs to exist. The standard deviation in a normal distribution must be >0. The solution used in this model incorporates the parameter constraints by making a custom simulation for each individual parameter. I have not found a good formulation on how to simulate the covariance matrix for a multivariate distribution yet.
The code for the clustering example, other uses cases, and documentation (still in progress) can be found in github.


Link: Generalized Machine Learning – Kerneml – Simple ML to train Complex ML