Designing an enterprise solution architecture for ML / AI / GenAI use cases

Introduction

Designing a fit-for-purpose solution architecture at an enterprise requires consideration of the following three elements:

  1. Requirements specific to the new use case (ML / AI / GenAI)
  2. The existing IT architecture landscape
  3. New candidate platforms and infrastructure that could help fill the gaps

This short article highlights the importance of identifying the intersection among the three elements so that one ends up with a solution architecture that meets the need of the new use case, uses as much as possible components from the existing IT architecture landscape and fills the gaps with tools, platforms or infrastructure that are needed to be brought in to arrive at a viable, working solution architecture.

Read More

Talk/Demo: Big Geospatial Data with Open-source Tech (Vectors, Rasters & Map-matching)

Summary

Geospatial datasets (i.e. geocoded data points) are everywhere nowadays and often add enormous value to data analytics/mining and machine learning projects. In this new era of Big Data, libraries and engines such as GeoPandas, PostGIS and the equivalent products in the commercial space often fall short and cannot scale up sufficiently to let us tap into the Big Data that is being collected in many use cases and by many organizations. In this talk/demo, we explore free, open-source, Big Data-ready technologies and workflows like GeoMesa, GeoPySpark and OSRM-on-Spark and show how to use these Apache Spark-based tech/workflows for key geospatial operations and use cases. We start by introducing GeoMesa and demo-ing how it can be used to ingest Big Geospatial Data and perform operations on vectors. Next, we briefly introduce GeoPySpark, the Python interface to Geotrellis, for performing operations on rasters. At the end, we turn to map-matching which is the process of associating names to geocoded data points from an underlying network (e.g., determining which street a particular GPS point should be associated with). We describe and demo how we can combine OSRM with Spark to do scalable map-matching on Big Data and therefore open up a lot of possibilities for advanced data mining and machine learning projects.

Slides

Video

Talk/Demo: Large-scale Experimentation with Spark & Productionizing Native Spark ML Models

Summary

Apache Spark is the state-of-the-art distributed processing, analytics and ML engine and we are presenting and demo-ing two interesting ways one can use Spark in ML projects: 1) we use Spark to distribute the grid-search optimization of a generic ML model (from a regular, single-machine ML library). We show how Spark can distribute processing tasks over the CPU cores of a cluster which gives a near-linear speedup and lowers processing times; hence it facilitates the exploration of a much larger space to find the optimal hyperparameters for the ML model. This use case is suitable when the projects do not involve Big Data and we use Big Data technologies, i.e., Spark, for the purpose of speeding up the processing of tasks; 2) we demonstrate how to train an example model using the ML lib of Spark itself and how to serve the model with MLeap, a production-quality, low-latency serving engine. This second use case/workflow is suitable when projects do involve Big Data.

Slides

Video

Talk/Demo: Supercharging Analytics with GPUs: OmniSci/cuDF vs Postgres/Pandas/PDAL

Summary

GPUs are known to significantly accelerate machine learning model training speeds, especially when using deep learning libraries like TensorFlow. But did you know that there are now solid options to also accelerate data analytics workloads, BI tools and dashboards with the help of GPUs? Join us for a presentation of performance benchmarks of GPU-based options and their CPU-based counterparts. We compare the performance that one could get from OmniSci Core DB (a GPU database) compared to the performance of Postgres DB (for data analytics) and PDAL (for LiDAR processing). On the in-memory side, we benchmark cuDF (NVIDIA’s GPU DataFrame) against the widely popular Pandas DataFrame. We will share results and include some code walk-throughs and live benchmarking. Coming out of this technical talk, you will have insight regarding how GPUs can accelerate your data analytics and geospatial workloads.

Video

Talk/Demo: Seq2seq Model on Time-series Data: Training and Serving with TensorFlow

Summary

Seq2seq models are a class of Deep Learning models that have provided state-of-the-art solutions to language problems recently. They also perform very well on numerical, time-series data which is of particular interest in finance and IoT, among others. In this hands-on demo/code walkthrough, we explain the model development and optimization with TensorFlow (its low-level API). We then serve the model with TensorFlow Serving and show how to write a client to communicate with TF Serving over the network and use/plot the received predictions.

Video

News Brief: 2013-2018 summary

Most projects on this site cover the pre-PhD period. The projects done during the PhD are documented under PhD Thesis as well as under Publications. Projects after the PhD, i.e., 2013 onward, are largely undocumented due to confidentiality restrictions, as I have been working in the industry. The projects that I have worked on in this period and recently are in Big Data and Deep Learning. There are few items that I have worked on on the side and could publish:

  • Productionization of TensorSpark in yarn-cluster mode (tested in an HDP cluster): I contributed to the TensorSpark project, helping people run it in a YARN-based production environment. TensorSpark implements Downpour SGD, a Google idea. This asynchronous stochastic gradient descent (SGD) is intuitively more suitable for cloud-based Spark clusters, as your cluster workers are typically sprinkled all over the data center and you want to avoid a network bottleneck which affects few workers to slow down too much the model training. See the GitHub issue/PR for details.
  • Class Activation Map is a great tool to help fine-tune and better understand a Deep Learning model (ConvNets). I created a notebook to help with this. Tech setup: Jupyter notebook / Python / TensorFlow / VGG model / Caltech256 dataset

While the above items are developed in my own time, I used them subsequently in the projects of the companies that I have worked for at the time.

Project: A Presence-based Messaging Application

Date Completed: December 2009

Here is the Report of this project.

As per the specification, a presence-based messaging and file-exchange application has been designed and implemented. Here is the scenario:

  • Clients connect to the server and immediately declare their presence.
  • A connected client can initiate a session by sending the request to the server along with the preferred number of clients in the session.
  • The server application checks the number of available clients for the session.
  • The server application initiates a session between the clients, if preferred number of clients are available.
  • When the session is underway, participants can exchange messages and files.
  • Only the session initiator can terminate the session.

Tech Report: Peer-to-Peer Traffic

Date Completed: August 2009

This Tech Report deals with Peer-to-Peer protocols. We start by giving a brief account of history of P2P applications and then cite from some of the P2P traffic measurement studies. P2P traffic identification methods and the recent P2P traffic optimization schemes constitute the core of this report, in which we examine the state-of-the-art in this field.

Tech Report: BitTorrent

Date Completed: July 2009

BitTorrent protocol has emerged as the most popular P2P protocol over the past years. The core BitTorrent protocol has been designed and implemented by Bram Cohen in 2001.

The protocol is especially useful for distributing large popular files (like open-source operating system distributions) as its performance improves as the number of interested connected peers increases. The way in which BitTorrent operates lessens the burden (hardware costs and bandwidth resources) of servers hosting the files and distributes that burden among all the peers currently connected, reducing costs significantly for original content distributors as a result. Connected peers share the task of serving the content to newly-connected peers and a “tit-for-tat” mechanism ensures fairness among all the peers. This method of content sharing also improves redundancy in the overlay network (formed around that specific content), as a probable malfunctioning of the original content provider does not render the content unavailable. In this Tech Report, we explain the functionality of the BitTorrent protocol and its various system components.