Posts Tagged - data-science | Masood Khosroshahy (Krohy)

13 Mar, 2024

Designing an enterprise solution architecture for ML / AI / GenAI use cases

Introduction

Designing a fit-for-purpose solution architecture at an enterprise requires consideration of the following three elements:

Requirements specific to the new use case (ML / AI / GenAI)
The existing IT architecture landscape
New candidate platforms and infrastructure that could help fill the gaps

This short article highlights the importance of identifying the intersection among the three elements so that one ends up with a solution architecture that meets the need of the new use case, uses as much as possible components from the existing IT architecture landscape and fills the gaps with tools, platforms or infrastructure that are needed to be brought in to arrive at a viable, working solution architecture.

12 Jun, 2019

Talk/Demo: Big Geospatial Data with Open-source Tech (Vectors, Rasters & Map-matching)

Summary

Geospatial datasets (i.e. geocoded data points) are everywhere nowadays and often add enormous value to data analytics/mining and machine learning projects. In this new era of Big Data, libraries and engines such as GeoPandas, PostGIS and the equivalent products in the commercial space often fall short and cannot scale up sufficiently to let us tap into the Big Data that is being collected in many use cases and by many organizations. In this talk/demo, we explore free, open-source, Big Data-ready technologies and workflows like GeoMesa, GeoPySpark and OSRM-on-Spark and show how to use these Apache Spark-based tech/workflows for key geospatial operations and use cases. We start by introducing GeoMesa and demo-ing how it can be used to ingest Big Geospatial Data and perform operations on vectors. Next, we briefly introduce GeoPySpark, the Python interface to Geotrellis, for performing operations on rasters. At the end, we turn to map-matching which is the process of associating names to geocoded data points from an underlying network (e.g., determining which street a particular GPS point should be associated with). We describe and demo how we can combine OSRM with Spark to do scalable map-matching on Big Data and therefore open up a lot of possibilities for advanced data mining and machine learning projects.

22 May, 2019

Talk/Demo: Large-scale Experimentation with Spark & Productionizing Native Spark ML Models

Summary

Apache Spark is the state-of-the-art distributed processing, analytics and ML engine and we are presenting and demo-ing two interesting ways one can use Spark in ML projects: 1) we use Spark to distribute the grid-search optimization of a generic ML model (from a regular, single-machine ML library). We show how Spark can distribute processing tasks over the CPU cores of a cluster which gives a near-linear speedup and lowers processing times; hence it facilitates the exploration of a much larger space to find the optimal hyperparameters for the ML model. This use case is suitable when the projects do not involve Big Data and we use Big Data technologies, i.e., Spark, for the purpose of speeding up the processing of tasks; 2) we demonstrate how to train an example model using the ML lib of Spark itself and how to serve the model with MLeap, a production-quality, low-latency serving engine. This second use case/workflow is suitable when projects do involve Big Data.

1 May, 2019

Talk/Demo: Supercharging Analytics with GPUs: OmniSci/cuDF vs Postgres/Pandas/PDAL

Summary

GPUs are known to significantly accelerate machine learning model training speeds, especially when using deep learning libraries like TensorFlow. But did you know that there are now solid options to also accelerate data analytics workloads, BI tools and dashboards with the help of GPUs? Join us for a presentation of performance benchmarks of GPU-based options and their CPU-based counterparts. We compare the performance that one could get from OmniSci Core DB (a GPU database) compared to the performance of Postgres DB (for data analytics) and PDAL (for LiDAR processing). On the in-memory side, we benchmark cuDF (NVIDIA’s GPU DataFrame) against the widely popular Pandas DataFrame. We will share results and include some code walk-throughs and live benchmarking. Coming out of this technical talk, you will have insight regarding how GPUs can accelerate your data analytics and geospatial workloads.

10 Apr, 2019

Talk/Demo: Seq2seq Model on Time-series Data: Training and Serving with TensorFlow

Summary

Seq2seq models are a class of Deep Learning models that have provided state-of-the-art solutions to language problems recently. They also perform very well on numerical, time-series data which is of particular interest in finance and IoT, among others. In this hands-on demo/code walkthrough, we explain the model development and optimization with TensorFlow (its low-level API). We then serve the model with TensorFlow Serving and show how to write a client to communicate with TF Serving over the network and use/plot the received predictions.