Patrick Herbert

Patrick Herbert

Senior Machine Learning Engineer

Specializing in performance-optimized ML systems and novel architecture research, from custom TensorRT inference to training multimodal transformers and creating tensor decomposition algorithms. Combining published research and AI experience with hands-on production engineering across speech model pipelines, agentic systems, and traditional ML.

Research Focus Areas

Multimodal Architectures
Optimization
Interpretability
High Performance Computing

Experience

October 2024 - Present
  • Real-Time Streaming Transcription: Designed and built production transcription backend serving live calls with async gRPC and custom TensorRT-compiled Parakeet TDT v3. Achieved 5.6x end-to-end speedup over NeMo PyTorch (261ms avg latency per GPU at 25 concurrent stereo calls with full fusion). Scales horizontally via AWS ECS/ALB. ~9x cheaper than the enterprise transcription alternative
  • Batch Transcription & Domain Language Model: Replaced WhisperX with customized Parakeet for offline call processing. Trained 6-gram KenLM domain language model on curated call transcripts with shallow fusion and GPU phrase boosting, reducing WER from 18.3% to 16.2%. Built custom evaluation framework (content WER, rare word WER, BERTScore, domain keyword accuracy). Achieved 91% AQA agreement with human QA graders, matching the enterprise model. 73% faster transcription, 275x lower cost
  • Voice AI Agent: Identified agents spending significant time on routine support (1,000+ voicemails/2 weeks for 66 agents). Built AI voice agent from business case through production deployment. Reduced human-required support requests by 70% and first-word latency by 85% via ONNX/TensorRT optimization of the speech pipeline
  • Automated Quality Assurance: Designed serverless LLM-based compliance system (AWS Lambda) achieving 99% call coverage with structured YAML output for token efficiency. Built layered error taxonomy and confidence prediction model
  • Multi-Objective Optimization: Replaced intractable discrete search (10¹⁹ parameter space) with differentiable optimization using Plackett-Luce ranking model and PPO-style trust regions, achieving 3–5% top-3 plan sales improvement with 11x training speedup
  • Payment Prediction: Built ML model improving successful payment rates from 47% to 59%, directly impacting revenue through data-driven lead prioritization
  • RAG Infrastructure: Rewrote traditional chunk-based RAG into agentic system over OCR-parsed markdown, achieving 2x+ improvement on groundedness, accuracy, and multi-hop questions. Contributed to LightOn OCR model training direction for insurance document parsing. Optimized OCR pipeline to ~45 seconds per 200-page document on A100
2024 - Present
  • Unified Dual-Stream Speech Encoder (WIP): Novel architecture jointly capturing semantics and prosody from a single FastConformer backbone for emotion-conditioned speech retrieval. Dual-stream split at ~70% encoder depth with ColBERT-style multi-vector retrieval head. Distilled from Qwen3-Omni and emotion2vec. Current SER UA: 72.0% on IEMOCAP with no ASR degradation
  • Multi-Modal Auto-Dubbing Pipeline: End-to-end pipeline translating foreign-language video to dubbed English with voice cloning. Chains Mel-RoFormer (source separation), Microsoft VibeVoice (transcription + diarization), TranslateGemma 12B (translation), and Qwen3-TTS (voice cloning) with timing synchronization
  • BM-ALS Tensor Decomposition: Novel tensor decomposition algorithm in Bhattacharya-Mesner Hypermatrix Algebra Space with JAX implementation, outperforming Tucker/CP by 2–30x across synthetic, physics (MuJoCo), and GPT-2 attention composition tensors. Under review
  • CoT Activation Scaffolding: Developed theory and ran behavioral experiments showing chain-of-thought tokens function as activation scaffolding rather than faithful reasoning traces, with mechanistic validation planned via Gemma Scope transcoders
May 2021 - March 2026
  • Custom Transformer Architecture: Built multimodal transformer from scratch for glaucoma progression prediction (clinical, OCT imaging, visual field data). Handles irregular temporal sampling and variable-length sequences. Evolved from 0.74 to 0.97 AUC. 12 publications, ICML 2023 workshop presentation
  • Clinical NLP & Computer Vision: Fine-tuned BERT, Llama, and Qwen with LoRA/PEFT for clinical entity extraction (0.96 AUC). Fine-tuned SAM foundation model for automated retinal segmentation (Dice 0.88)
  • Research Data Infrastructure: Led infrastructure managing 10TB+ medical data, enabling multiple concurrent research projects and publications
May 2019 - December 2021
  • Quantitative Finance ML: Led team developing neural networks for S&P 500 return forecasting, identifying key predictive features from 94 stock characteristics
  • Neural Network Optimization: Developed mathematical formulations for network analysis, discovering that pruned networks can achieve higher adversarial robustness while maintaining accuracy

Selected Publications

Detecting glaucoma worsening using optical coherence tomography derived visual field estimates

Scientific Reports (2025) • Pham, A. T., Bradley, C., Hou, K., Herbert, P., & Yohannan, J.

The Impact of Achieving Target Intraocular Pressure on Glaucomatous Retinal Nerve Fiber Layer Thinning in a Treated Clinical Population

American Journal of Ophthalmology (2024) • Pham, A. T., Bradley, C., Hou, K., Herbert, P., Boland, M. V., Ramulu, P. Y., & Yohannan, J.

Deep learning-based identification of eyes at risk for glaucoma surgery

Scientific Reports (2024) • Wang, R., Bradley, C., Herbert, P., Hou, K., Ramulu, P., Breininger, K., Unberath, M., & Yohannan, J.

Take a Closer Look

My background combines academic rigor with production engineering. You can read the full CV, or ask the AI assistant about specific implementation details.