Machine Learning Pipeline for Biochemistry
Our client is a major international pharmaceutical company, which conducts research and development activities related to a wide range of human medical disorders, including mental illness, neurological disorders, anaesthesia and analgesia, gastrointestinal disorders, fungal infection, allergies, and cancer.
Deliver very fast and highly scalable calculation pipeline that uses different machine learning algorithms to learn and predict chemical compound activity to reduce the number of real experiments. The provided solution should be able to handle different types and formats of both input and output data through additional abstraction layer that suits all needs of our customer and other scientific groups that are involved in the development process.
The pipeline should be cross-platform with the ability to run at least under Windows and Linux environments. It should be possible to distribute complex calculations both on CPU core and cluster level. An additional challenge was introduced by cooperation between the main development team and scientific teams including two universities located in Belgium and Austria. This required ability to work as one team in a multi-language environment including C, Java, Python and R, and skill to integrate all parts into the pipeline.
C++ was chosen as the main development language. This choice was obvious because C++ allowed development of high-performance code and was the most common language for the scientific community. On the other hand, we integrate into the pipeline other solutions, which were originally introduced on C, Java, Python and R. To suit the needs of different scientific groups, we plan to use SWIG to allow interfacing with the pipeline in the future.
The whole pipeline was written on C++. The only exceptions were Java Service that was used for fingerprint calculation and R molecule similarity package that served as interfacing library for corresponding kernel similarity metrics implemented on C++. Java service relied on jCompoundMapper library which was used for fingerprinting of chemical compounds. This library had a number of minor bugs which required additional fixes of the original version.
Any molecule similarity methods always heavily rely on chemistry libraries. We used a professional C++ solution provided by OpenEye — OEChem TK, which was also available in Python. Unfortunately, this library required a license, so we developed an additional abstraction layer to allow interfacing with other chemistry libraries. We planned to include RDKit library which also had C++ and Python versions.
To allow compilation on different platforms, we used cmake. Compared to other alternatives, it was highly automated, configurable and easy to use, allowing us to generate makefiles for required platforms. Our pipeline relied on STL and Boost C++ which was an obvious choice. Boost library allowed us to write cross-platform code and concentrate on core features during the development process, making it more robust.
To achieve high performance and distribute complex calculations, we used TBB C++ library developed by Intel professionals. Compared to other alternatives, TBB had the greatest performance, it was cross-platform, compiler-independent, and components of the library can be used separately. It also had intuitive API and good documentation, which was very important. Additionally, we allowed distribution of calculations over the cluster because the pipeline could run different parts independently consuming settings for particular node specified in JSON format.
Input and output data came in different types and formats. To handle large amounts of text-based and binary data, we use cross-platform compression libraries including zlib and bzip2. The typical way of data sharing in the scientific community is HDF file format. It is a binary format designed for large datasets, and it supports compression. We used HDF C++ library provided by HDF Group. It also had implementations on R and Python. Unfortunately, this library was not multi-threaded, and thread-safety is not stable relying on pthread library which was not available on all platforms. We had to introduce additional multi-threading and thread-safe wrapper layer to suit our needs.
Another option for handling large amounts of data was introduced by using Redis. It was a very simple high-performance key-value storage which allowed us to avoid over-complication brought by traditional relational databases. There were different versions of Redis clients including C++, Python and R versions. We used official hiredis C library. Unfortunately, Windows version of this library had certain limitations; we overcame them by improving it.
Classical machine learning algorithms are implemented within the pipeline. We also use additional libraries to increase their number. One of them is libsvm Support Vector Machine library which we adapted to the pipeline with a variety of fixes and improvements. Another option that we were planning to integrate was MultiBoost learner library.
Automatic testing of different algorithms was one of the important parts of the development process. We used Boost test framework because it was part of Boost C++ library and had easy integration with cmake.
There were also other post-processing steps written on R, Java and Python and developed by other scientific groups. In the future we are planning to allow interfacing with the pipeline by using SWIG.
Our solution made it possible to handle large amounts of data provided in different types and formats that were common for the scientific community. Pipeline utilised different machine-learning algorithms allowing distribution of complex calculations over multiple cores or even a cluster.
Using C++ as the main development language together with CMake allowed our solution to be cross-platform. All libraries were chosen in the way that we could reduce development time and introduce additional interfacing layer in the future for such languages as R and Python, which is very important for the scientific community.