---
title: "The malaria vector selection atlas"
shorttitle: "The malaria vector selection atlas"
slug: selection-atlas
date: 07/21/2025
thumbnail: '/thumbnails/atlas.png'
tag: malariagen_data, genetics
canonicalUrl: 'https://sanjaycnagi.com/blog/2025-07-21-selection-atlas/'
---

[![selection atlas signals](/blog/signals.png)](doi.org/10.1101/2025.07.16.664900)

Dual-active ingredient ITNs are now being deployed in sub-Saharan Africa. We know that these nets are not only highly effective but effective against pyrethroid-resistant malaria vectors, which should help in the fight against malaria. But given the evolutionary adaptability of malaria mosquitoes, we must rapidly find ways to detect resistance before it becomes widespread, in order employ insecticide resistance management practices and protect the lifespan of these critical tools. 

When a beneficial mutation arises — like one that helps a mosquito survive insecticide exposure — it begins to spread through a population, leaving a signature of reduced genetic diversity around it. We can exploit this signature to pinpoint regions of the genome that are under selection and threatening malaria vector control efforts. 

In a new pre-print on [bioRxiv](https://www.biorxiv.org/content/10.1101/2025.07.16.664900), we present a web resource, the [**Malaria Vector Selection Atlas**](https://anopheles-genomic-surveillance.github.io/selection-atlas/), a database of selection signals in wild Anopheles mosquitoes within the [Vector Observatory](https://www.malariagen.net/vobs/). We analysed whole-genome sequence data from over 4,300 mosquitoes collected across 21 African countries, using population genomic methods to scan the genomes of our cohorts for telltale signs of recent positive selection. The atlas reveals both familiar loci and new threats: we confirm selection at established resistance genes like the *Vgsc* (the target of DDT and pyrethroids) and metabolic enzyme clusters like the *Cyp6p* locus that break down insecticides. But we also discovered novel signals including a diacylglycerol kinase on the X chromosome that may represent a previously unknown resistance mechanism. The selection atlas is powered by an automated computational workflow that enables continuous updates as new data become available. As we integrate more recent data, we envisage that the resource will provide a crucial early warning system for tracking mosquito evolution in near real-time, helping to inform evidence-based decisions about where and how best to deploy our limited vector control tools.
 
---

*Explore the pre-print on [bioRxiv](https://www.biorxiv.org/content/10.1101/2025.07.16.664900v1) and the [selection-atlas web resource](https://anopheles-genomic-surveillance.github.io/selection-atlas/)*


The malaria vector selection atlas

---
title: "AnoKin: Using genomics to estimate dispersal in malaria mosquitoes"
shorttitle: "AnoKin: a close-kin mark-recapture study"
slug: anokin-ckmr-dispersal
date: 03/05/2025
thumbnail: '/thumbnails/dispersal.png'
tag: ckmr, dispersal
canonicalUrl: 'https://sanjaycnagi.com/blog/2025-03-05-ckmr-dispersal/'
---

![anokin logo](/blog/anokin-logo.png)

---

Despite hundreds of years of research, we do not understand how far the malaria mosquito can fly.

This is challenging for many reasons. How can we design a cluster randomised controlled trial (cRCT), without knowing if mosquitoes will fly from a control cluster into an intervention cluster? How can we model the spread of gene drives without knowing how far a single mosquito could transport the drive? 

---

Last year, I was awarded some funding from the [Liverpool School of Tropical Medicine](http://lstmed.ac.uk/) to work on this problem, as part of their [Directors Catalyst Fund](https://www.lstmed.ac.uk/study/research-degrees/director%E2%80%99s-catalyst-fund-0). The idea was to use a genomic approach known as close-kin mark-recapture (CKMR). The method is analogous to traditional mark-release recapture, except rather than marking individuals and releasing them, we use kin to mark each other. We sequence mosquitoes, infer their kinship, and use the distances between kin to estimate how far they disperse. It was first developed in fisheries science, and has many applications; it can be used to investigate dispersal, but also population size as well as survivorship. 

For the last couple of weeks, I have been based in Kisumu, western Kenya, working with Dr Eric Ochomo's team and Brian Polo of the [Kenya Medical Research Institute (KEMRI)](https://www.kemri.go.ke/) to set up the sampling. We are working at Lake Kanyaboli, a site of high malaria transmission that provides a stable habitat for the malaria mosquito *Anopheles funestus* throughout the year. Everything is going super well so far - I am really fortunate to be working with an awesome team!

#### The AnoKin site

In the spirit of transparency, part of my proposal was to set up a [public-facing website for AnoKin](https://sanjaynagi.github.io/anokin/). The site contains some background to the project, some protocols, and important links for the field team.

In ultimate nerdiness 😎, I've set up a [daily GitHub actions](https://github.com/sanjaynagi/anokin/blob/main/.github/workflows/docs-auto.yml) which downloads the sampling and morphological ID data from the ODK server where we store data. It then runs a jupyter notebook to perform some analysis, and re-builds and re-publishes the [webpage](https://sanjaynagi.github.io/anokin/). 

So far, the sampling is going really well. Numbers are relatively low for Lake Kanyaboli, but this may actually be a positive - sampled mosquitos are probably more likely to be related. Follow along on here and the AnoKin website, where I'll post regular updates as our project attempts to yield new insights into mosquito dispersal.

AnoKin: a close-kin mark-recapture study

AnoKin: Using genomics to estimate dispersal in malaria mosquitoes

---
title: Neural networks for taxon prediction
shorttitle: Neural networks for taxon prediction
slug: neural-nets-taxon
date: 02/19/2025
thumbnail: '/thumbnails/phylo.png'
tag: deep-learning
canonicalUrl: 'https://sanjaycnagi.com/blog/2025-02-19-neural-networks-for-taxon-prediction/'
---

<center>
  <figure>
    <img
      src="/blog/tree-river.png"
      alt="phylo tree and"
      width="600"
      height="600"
    />
  </figure>
    <figcaption>
      Figure 1. Some trees.
    </figcaption>
</center>


These two trees are topologically identical. One of them is a river system creating some of the most fertile land on earth, supporting a population of over 300 million people. And the other is a phylogenetic tree of the An. gambiae complex. One of them is also tattooed on my arm. extra points if you can guess the rivers. 

#### Anopheles gambiae; a *complex* world

The major malaria mosquito *Anopheles gambiae*, is a species complex of at least seven recognised species, and at least four additional cryptic taxa. These taxa have varying contributions to malaria transmission, with *An gambiae s.s* and *An. coluzzii* considered the most important vectors, due to their anthropophilic nature, or in other words, their preference for human blood. It is estimated that the ancestor of *An. gambiae* and *coluzzii* diverged from the rest of the complex [approximately 2 million years ago](https://doi.org/10.1126/science.1258524).  

<center>
  <figure>
    <img
      src="/blog/range.png"
      alt="Range of species in the gambiae complex"
      width="350"
      height="350"
    />
    <figcaption>
      Figure 2. Range of species in the gambiae complex <a href="https://doi.org/10.1126/science.1258524">Fontaine et al., 2016</a>.
    </figcaption>
  </figure>
</center>


*An. arabiensis* on the other hand, shows more flexible host preferences but remains an important vector, particularly in drier areas, where it tends to dominate. Meanwhile, *An. merus* and *An. melas* are salt-tolerant coastal species and minor vectors. *An. quadriannulatus* is zoophilic (it prefers to feed on other animals) and is thought to play virtually no role in malaria transmission despite being competent to carry the parasite. 

Despite these important phenotypic differences, these species are morphologically identical - you can't tell them apart even under a microscope. Traditionally, researchers have relied on molecular methods to identify members of the *An. gambiae* complex. The gold standard has been PCR-based assays targeting the ribosomal DNA intergenic spacer regions (IGS), developed by [Scott et al. in 1993](https://doi.org/10.4269/ajtmh.1993.49.520) and later with the SINE200 method [(Santolamazza et al.,  2008)](https://doi.org/10.1186/1475-2875-7-163). Importantly, both methods rely on single genetic loci, which can miss hybridization between species, and these methods cannot detect cryptic taxa, and in those cases provide incorrect species assignments.

#### Neural networks for taxon prediction

I've long been intrigued by the potential of machine learning to solve biological problems. For our recent development of a [platform for targeted genomic surveillance of vectors](https://doi.org/10.1101/2025.02.14.637727), we needed to be able to identify species within the gambiae complex. The amplicon panel itself, originally designed long before my entry into the field by collaborators at the Sanger institute, was built for discriminating between *An. gambiae* and *An. coluzzii*, but not necessarily other species. I had been trying out some traditional machine learning methods, but with Large language models (LLMs) demonstrating remarkable capabilities in so many fields, I thought it was finally time to code up a neural network of my own.

Neural networks are computing systems inspired by the human brain's architecture. They consist of interconnected nodes ("neurons") organized in layers - an input layer that receives data, one or more hidden layers that process the information, and an output layer that produces the final result. Each connection between neurons has an associated weight, determining how much influence one neuron has on another. The network "learns" by adjusting these weights through a process called backpropagation, guided by a loss function that measures how far predictions deviate from known values.

Using free GPUs provided in Google Colab, alongside the TensorFlow and Keras libraries, I trained the network on ancestry informative markers within whole-genome sequence data from the Anopheles gambiae 1000 Genomes Project (Ag1000G). The neural network performed exceptionally well, achieving 100% accuracy in classifying all five species in the complex with sufficient numbers of samples: *An. gambiae*, *An. coluzzii*, *An. arabiensis*, *An. melas*, and the recently described Bissau molecular form. You can find the notebook for deep learning taxon prediction [here](https://colab.research.google.com/drive/1rcanKIJyD5Pnzg_17HCBUuVeIWe5QA5j?usp=sharing).

#### Four SNPs to Rule Them All, and in the Darkness, Predict Taxa

So, pretty amazing that a neural network can classify 5 species with 100% accuracy. Or is it? well, it turns out that neural nets are kind of overkill for this problem. In fact, what I found in the targeted surveillance paper, was that a simple decision tree using only four ancestry informative markers (AIMs) could predict the same taxa to a very high level. 

The decision tree (Figure 3) showed remarkable accuracy for most species: *An. gambiae* (F1 = 0.995), *An. coluzzii* (F1 = 0.995), *An. arabiensis* (F1 = 1.000), and *An. melas* (F1 = 1.000). Although these SNPs were selected to distinguish between gambiae and coluzzi, it seems that they get fixed in different ways in the various species, allowing us to use these combinations to predict taxa. It performed slightly less well for the Bissau form (F1 = 0.849), likely because this cryptic taxon harbors less consistent AIM genotypes compared to the other species. 

![decision tree](/blog/tree.svg)
Figure 3. A simplified representation of a decision tree trained on four AIMs to predict species in the Anopheles gambiae complex

--- 

And back to my opening question - the right image depicts the Five Rivers of Punjab; the Jhelum, Chenab, Ravi, Beas, and Sutlej, tributaries of the Indus (admittedly, not in their correct orientation). The word Punjab literally means 'land of the five (panj) waters (ab)' The similar topology between these two systems - one representing my scientific work, and the other an ancestral home - was an intriguing reminder of the unexpected connections we can find in nature.

Neural networks for taxon prediction

---
title: Claude AI as a language learning companion
shorttitle: Claude AI as a language learning companion
slug: claude-ai-hindi
date: 01/06/2025
thumbnail: '/thumbnails/hindi.png'
tag: llm
canonicalUrl: 'https://sanjaycnagi.com/blog/2025-01-06-llms-language-companion/'
---

![punjab](/blog/punjab.jpeg)

As I’m writing this, I'm sat in the back of my cousins car, driving through the chaotic, cacophonous streets of Ludhiana, a large, sprawling city in the heart of the Indian state of Punjab. I am midway through a trip to visit family here, a trip that I tend to make every year or two. My father moved to England from Punjab in the early 1980s. It was only ever intended to be a temporary visit, but he has remained there ever since, building a life, career and family, far away from home. Whilst we visited India when we were young, me and my siblings never had the chance to learn our father's two native tongues, Punjabi and Hindi. It was something I always felt compelled to address. 

Since submitting my [PhD thesis](https://archive.lstmed.ac.uk/23310/) two years ago, I've been learning Hindi, primarily through online tutoring on [Preply](https://preply.com/). Progress has been steady, despite the fact I'm terrible at doing homework set by my tutor (if you are reading, I’m sorry, Shivaani!). I've recently found a tool, however, which has revolutionised my learning - using a large language model, Claude, as a personal language tutor and learning companion. 

Claude is a family of large language models (LLM) developed by Anthropic, many of which are currently world-leading. These AI models are neural networks trained on vast amounts of textual data, allowing them to understand the context and relationships between words. LLMs are creating enormous interest for their potential to transform many aspects of daily life, such as assisting software developers in writing and debugging code more efficiently, automating routine tasks like email drafting and scheduling, or revolutionizing customer service through intelligent chatbots. In this post, I'll share how I've been using Claude to help me learn Hindi.

---

One of its biggest strengths lies in its ability to generate engaging learning materials. I first started using generative AI for Hindi with OpenAI’s ChatGPT, where I used it to generate stories which I could then translate for practice. It's fun - you can personalise them, and then read the story as you translate (my personal favourite was the story of Sanjay, a cricketer for India who moonlighted as a Bollywood actor, until one day, his secret was discovered 😅). This was with an earlier model (GPT-3.5), and the stories often contained many small mistakes which my tutor would need to correct before I started. Now, when I do this with Claude Sonnet 3.5, the stories are essentially mistake-free. Beyond these stories, it helps to create personalised homework assignments and other translation exercises tailored to my learning needs.

I've found it incredibly useful for breaking down complex grammar structures. It can offer multiple perspectives until the concept resonates, and provides detailed explanations accompanied by practical examples. Each sentence can be analyzed in detail, showing both its literal meaning (like how 'mujhe chai peeni chahiye' directly translates to 'to-me tea drink should) and its natural English equivalent ('I should drink tea'). Typically, I’ve found it really challenging to rapidly access this information online. 

Day to day, my family speak Punjabi to each other rather than Hindi - although they can understand me speaking Hindi, this can make it quite difficult for me to understand them. I’ve found Claude helpful for this, because I can ask it to give me the Punjabi version of a phrase in Hindi, with helpful explanations that go beyond Google Translate. It is also able to easily switch between scripts (for example, Devanagari and Latin script). And most importantly, unlike any normal tutor or learning buddy, Claude is available 24 hours a day, 7 days a week, and responds immediately. It responds thoughtfully, and never judges when responding, giving the user freedom to continue asking questions until understanding arrives. 

My intuition is that whilst LLMs are particularly useful at a beginner level (as I am), as you get more and more proficient in a language, their utility may decrease. And of course, there is only so much a text-based tutor can help with learning a language - nothing will replace real time conversation with native speakers. The nuances of pronunciation, the rhythm of natural conversation, and the subtle cultural clues that emerge through face-to-face interaction remain essential components of achieving fluency. 

While I'm still early in my Hindi learning journey, I'm excited to see how AI tools like Claude continue to evolve as learning companions. As an example of this evolution, Duolingo now have a new subscription tier, [Duolingo Max](https://blog.duolingo.com/duolingo-max/), powered by GPT-4. If you're interested in learning more about Claude and the future of LLMs, I'd recommend having a [play around](https://claude.ai/new), or listening to this episode of the Lex Fridman [podcast](https://open.spotify.com/episode/69V7CtdbB8blcxNPXvpnmk?si=AEsAvzaKQ3iZZp6qA0d8YA), with Dario Amodei (the CEO of Anthropic), Amanda Askell (a philosopher, responsible for shaping Claude's personality), and Chris Olah (co-founder at Anthropic, reverse engineering neural networks).


Claude AI as a language learning companion

---
title: Karyotyping chromosomal inversions in malariagen-data
shorttitle: Karyotyping chromosomal inversions in malariagen-data
slug: karyotyping
date: 11/28/2024
thumbnail: '/thumbnails/chrom.png'
tag: malariagen_data
canonicalUrl: 'https://sanjaycnagi.com/blog/2024-11-28-karyotyping/'
---

Chromosomal inversions, where a region of a chromosome is flipped, play an important role in the evolution of malaria vectors, influencing their ability to exploit ecological niches and potentially affecting their ability to transmit parasites and tolerate insecticide exposure.

The Vector Observatory's [malariagen_data package](https://github.com/malariagen/malariagen-data-python) itself continues to evolve, providing researchers with powerful tools for accessing and analyzing genomic datasets from major malaria vectors. A recent addition to this toolkit is a karyotyping method that allows users to determine the karyotype of specimens in the Vector Observatory by genotyping inversion tagging SNPs. 

Here I plot karyotype frequencies of the 2La inversion in samples from phase 3 of the ag1000g resource, including additional cohorts in VObs we've been working on at the Liverpool School of Tropical Medicine. 

<HTMLPlot
  title="2La frequencies in Ag3"
  pathname="/blog/2La.html"
  height_before_scale={550}
  width_before_scale={600}
/>

You can see that the 2La karyotype (in green) is at higher frequencies as you move towards the more arid Sahel. The Colab notebook to generate this plot is [here](https://colab.research.google.com/drive/1qlhygvn1N9XFn5oQgWIoQOMIzMQkSGoE?usp=sharing).

#### Usage of karyotyping in malariagen_data

The functionality uses tagging SNPs from [compkaryo](https://doi.org/10.1534/g3.119.400445). Previously, the tool only reported the positions of the tagging SNPs, which would cause issues at sites which are multi-allelic - we would not know which alternate allele is the inversion tagging SNP. I've now found the exact alternate alleles, making the estimates more robust.

Here's an example of how to use the new karyotyping function in malariagen_data:

```python
df_karyo = ag3.karyotype(
    inversion="2La",  # Inversions to analyze
    sample_sets="1244-VO-GH-YAWSON-VMF00149",  # The sample set
    sample_query=None
)

df_karyo.head()
```

I hope this is a useful addition for users of malariagen_data! Any questions or feedback, please get in [touch](mailto:sanjay.c.nagi@gmail.com).

*Thank you to Anastasia Hernandez-Koutoucheva and Alistair Miles for the code to plot pie charts with Bokeh!* 


Karyotyping chromosomal inversions in malariagen-data

---
title: Using diplotype clustering to discover mutations causing insecticide resistance in malaria mosquitoes
shorttitle: Diplotype clustering for genomic surveillance of mosquitoes
slug: diplotype-clustering
date: 08/15/2024
thumbnail: /thumbnails/dendro.png
tag: genetics 
canonicalUrl: https://sanjaycnagi.com/blog/2024-08-15-diplotype-clustering/
---

Vectors of disease evolve rapidly in response to the interventions we throw at them. By monitoring genetic changes in these populations over time, we can detect emerging insecticide resistance mechanisms and monitor the spread of known mechanisms, with the aim of informing vector control strategies. In a world of limited active ingredients, this surveillance will be crucial for maintaining the effectiveness of front-line interventions like long-lasting insecticide-treated bed nets (LLINs) and indoor residual spraying (IRS).

To facilitate this, we have been developing systems and tools for genomic surveillance. The MalariaGEN Vector Observatory have developed the Python package [`malariagen_data`](https://malariagen.github.io/malariagen-data-python/latest/), which provides tools for accessing and analysing genomic datasets from major malaria vectors. By developing innovative software, we can help to build capacity in genomic research, allowing more people to perform robust, complex genomic analyses that would otherwise be limited to a select few.

In this blog post, we wanted to share a new function that we have recently added to `malariagen_data`: [plot_diplotype_clustering_advanced()](https://malariagen.github.io/malariagen-data-python/latest/generated/malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced.html#malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced). This method allows us to rapidly zoom in on a genome region of interest and identify selective sweeps, assess their size, detect potential gene flow events between countries or species, and investigate whether sweeps are driven by copy number variants (CNVs), amino acid mutations or both.

But what exactly are diplotypes, and why are they useful? A diplotype, sometimes referred to as a multi-locus genotype, is essentially the combination of two haplotypes from a single mosquito - one from each chromosome - at a particular genomic region. By analysing diplotypes rather than haplotypes, we can better capture the full genetic variation present in an individual, including complex structural variants like CNVs that can be difficult to phase onto haplotypes. Often, CNVs and multiallelic SNPs are ignored when analysing haplotype data. The more mosquitoes we sequence, the worse this problem gets - *An. gambiae s.l* is so genetically diverse, eventually, a significant proportion of all SNPs become multiallelic. 

![diplotype](/blog/diplotype.png)
*Figure 1. Illustration of the relationship between diplotypes and haplotypes*

The new diplotype clustering functionality in `malariagen_data` performs hierarchical clustering on diplotypes from a specified genomic region. It then visualises the results, displaying:  

&nbsp;&nbsp;&nbsp; 1. The clustering dendrogram  
&nbsp;&nbsp;&nbsp; 2. Sample metadata (e.g., species, collection location)  
&nbsp;&nbsp;&nbsp; 3. Heterozygosity of each sample (within this genomic region)  
&nbsp;&nbsp;&nbsp; 4. Copy number at genes of interest  
&nbsp;&nbsp;&nbsp; 5. Amino acid variants in a specified transcript  

#### A case study

To illustrate the power of this approach, let's look at a case study of the *Gste2* gene from some recent whole-genome data of *An. gambiae s.l* from Obuasi, central Ghana (Figure 2). *Anopheles* mosquitoes from this area are highly resistant to multiple classes of insecticides [[1](https://bmcinfectdis.biomedcentral.com/articles/10.1186/s12879-022-07795-4)]. The *Gste2* gene is known to be involved in resistance to DDT (and potentially other insecticides), through either copy number variation [[2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6673711/)], amino acid mutations [[3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3968025/)], or both. *Gste2*-I114T and *Gste2*-L119V are the major amino acid mutations at this locus known to confer resistance.

Here is an example of how to use the [plot_diplotype_clustering_advanced()](https://malariagen.github.io/malariagen-data-python/latest/generated/malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced.html#malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced) function:  


``` python
ag3.plot_diplotype_clustering_advanced(
    region="3R:28,597,000-28,600,000",         # The genomic region for clustering
    cnv_region="3R:28,594,000-28,605,000",     # The genomic region for CNV data
    snp_transcript="AGAP009194-RA",            # The transcript for amino acid variants
    sample_sets="1244-VO-GH-YAWSON-VMF00149",  # The sample set
    sample_query=None,                         # A query to filter samples
    site_mask="gamb_colu",                     # The site mask to use
    linkage_method="complete",                 # The linkage method to use
    color="taxon",                             # The metadata column to determine color
    )
```  

<br></br>
<br></br>
<br></br>
<br></br>

There are many more optional parameters for the user to configure - see the [API docs](https://malariagen.github.io/malariagen-data-python/latest/generated/malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced.html#malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced) for more information. Here is the figure it produces:


![dipclust](/blog/dipclust-gste2.png)
*Figure 2. Diplotype clustering at Gste2.*

We've annotated this figure with some diplotype clusters which are particularly interesting. For example, cluster B contains a large number of diplotypes which are all genetically identical and have very low heterozygosity - this is what you expect when a selective sweep has occurred, and you find many individuals that are homozygous for the haplotype under selection. All individuals in cluster B also carry the *Gste2*-I114T substitution. If we did not already know something about this mutation, this figure would give us a clue that the mutation is a potential driver of selection and insecticide resistance. In fact, we already know this mutation causes insecticide resistance, so it is no surprise to find it linked to a selective sweep in this dataset.

We can also see another large cluster (cluster C) which does not harbour either I114T or L119V, but instead, the *Gste2*-F120L mutation. This cluster is homozygous for F120L and shows low heterozygosity, again indicative of diplotypes which have two copies of the same swept haplotype. Two things about the *Gste2*-F120L mutation are convincing as a potential driver of resistance. Firstly, it is in very close physical proximity to known resistance mutations in codons 114 and 119. According to Riveron et al., the 120 codon is located at the active site of the enzyme and is therefore likely to interact with the insecticide [[4](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r27)]. Secondly, there are no CNVs associated with this sweep, and no other amino acid variants except N3K, which is less likely to be causative due to its physical location away from the active site. 

We also observe a small cluster of individuals (cluster A) which harbour a copy number variant (CNV) spanning *Gste2, Gste1, Gste3* and *Gste7*. CNVs could be driving insecticide resistance by increasing the expression of the genes they encompass, allowing the mosquito to detoxify more of the insecticide as a result. This CNV, an amplification, seems to exist at a variable copy number. In total, we can see six or seven distinct CNVs in these samples.

This case study demonstrates how diplotype clustering can provide insights into the mutations causing insecticide resistance; in a single snapshot, we can explore amino acid and CNV data and really understand the nature of selection at a genomic region. 

We hope that others will find it useful for their own research. If you use diplotype clustering in your own research, please cite [this paper where the method was first published](https://doi.org/10.1093/molbev/msae140). Please feel free to [get in touch](mailto:sanjay.c.nagi@gmail.com?subject=diplotype-clustering) if you have any questions or feedback :)

[Sanjay C Nagi](https://www.sanjaycnagi.com/) & [Alistair Miles](https://alimanfoo.github.io/)


Diplotype clustering for genomic surveillance of mosquitoes

Using diplotype clustering to discover mutations causing insecticide resistance in malaria mosquitoes

---
title: "Book Club"
shorttitle: "Book Club"
slug: "book-club"
date: "05/14/2024"
thumbnail: '/thumbnails/books.png'
tag: books
canonicalUrl: 'https://sanjaycnagi.com/blog/2023-07-14-book-club/'
---

![library](/blog/library.png)
*My kind of library*

I recently got back into reading after a few months hiatus, and realised this blog would be an ideal place to keep track of what books I've been reading, and share any recommendations with the world (should anyone actually be reading this!).  

### Books

**Homegoing by Yaa Gyasi**  
A very tragic but amazing book. ⭐⭐⭐⭐⭐

**The Psychology of Money by Morgan Housel**  
Must read. ⭐⭐⭐⭐⭐ 

**Monkey King by Wu Cheng'en**  
So fun. Loved it! ⭐⭐⭐⭐⭐

**English Pastoral: An Inheritance by James Rebanks**  
⭐⭐⭐⭐⭐

**Atomic habits by James Clear**  
I started reading this after a stag do, just when I was contemplating all the things that I should really be doing in life. The book argues that real change comes from the compound effect of hundreds of small decisions which over time create a new identity - initially, it might be doing two push-ups a day, or waking up five minutes early. Clear calls these small changes "atomic habits." The book draws from proven ideas in biology, psychology, and neuroscience to create a guide for making good habits easier and bad habits more difficult. ⭐⭐⭐⭐

**The Lean Startup by Eric Reis**  
Steve Jobs led me to The Lean Startup, a book I'd been meaning to read for many years. The book is based on applying the tenets of lean manufacturing to startups. Lean manufacturing is a process developed in Japan at Toyota, which aims to eliminate waste to increase efficiency. The Lean Startup aims to shorten product development cycles and rapidly discover if a proposed business model is viable. This is achieved by adopting rapid scientific experimentation, early product releases, and what Eric Reis calls 'validated learning'. ⭐⭐⭐⭐

**Steve Jobs by Walter Isaacson**  
A fantastic read. Before this, I knew very little about Steve Jobs. He certainly was an odd fellow. So fascinating to hear about the beginnings of Apple; the excitement that Jobs and Wozniak must have felt in those early days is palpable. As someone who works in Tropical Medicine - a field heavily funded by the [BGMF](https://www.gatesfoundation.org/) - his relationship with Bill Gates is also particularly interesting. ⭐⭐⭐⭐⭐

**The Mountains Sing by Nguyễn Phan Quế Mai**  
A really wonderful book about the multigenerational saga of the Trần family. It depicts the struggles and triumphs of the Vietnamese people as they navigate the challenges of colonialism, communism, and the war with the United States. I realised when reading this, I knew so little about the Vietnam War, and the atrocities committed by the US government. There is so much loss and heartbreak in the story, and yet life and love endure on, in a really beautiful way.  ⭐⭐⭐⭐⭐

**The Odyssey by Homer, translation by Robert Fagles**  
After reading a few books about Greek mythology, including Stephen Fry's excellent Mythos and Heroes, I decided to read an actual classic itself, beginning with the Odyssey, one of Homers two epic poems. I thoroughly enjoyed it, and was pleasantly surprised by its readability; I don't read of lot of poetry and the book is written in verse, but for the most part, it reads like prose.  ⭐⭐⭐⭐

**The Unfolding of Language by Guy Deutscher**  
What an awesome book - this has been blowing my mind for the last couple of weeks. It's about the evolution of language, the destructive and creative forces which cause it to change, such as economy, expressiveness, and analogy. I'm learning Hindi at the minute, and its actually really helped me to understand why some of the things in English and Hindi are the way they are. ⭐⭐⭐⭐⭐

**Running with the Kenyans by Adharanand Finn**  
In 2023, I really got into running. At the time of writing, I'm also in Kenya, and after a recommendation from a friend, figured this could be a good shout. It was. Although the book purports to be about finding the 'secrets' to the exceptional feats of Kenyan runners, it really is just about the authors journey to Kenya with his family, and the wonderful people he meets there. In reality, there are no 'secrets'. And its a really lovely read. Get me to Iten!! ⭐⭐⭐⭐

---

### A few favourites that pre-date the blog
**Nelson Mandela - Long Walk to Freedom**  
Everyone on earth should read this book! It's been some years since I read it, but I always remember it having a profound impact. The world would be a better place if we all had to read it.

**Richard Dawkins - The Selfish Gene**  
I owe a lot to this book. I read it during a formative period, inbetween my Bachelors and joining LSTM to study for a masters. It really opened my eyes to the wonderful world of evolutionary biology, and I've been hooked ever since. It also helped to awaken a thirst for knowledge which has remained with me.

**The Ramayana - Linda Egenes and Kumuda Reddy**  
This was the first (and only) version of the Ramayana I've read. The Ramayana literally means the Journey of Ram, and tells the story of Rama, the prince of Ayodha, who wages a war against the demon king Ravana to rescue his wife, Sita. And it is this victory of light triumphing over evil for which we celebrate Diwali.  It's a really beautiful book. 


Book Club

---
title: "My favourite polyfluorinated pyrethroid"
shorttitle: "My favourite polyfluorinated pyrethroid"
slug: "transfluthrin-resistance"
date: "09/07/2023"
thumbnail: '/thumbnails/tft.png'
tag: repellent 
canonicalUrl: 'https://sanjaycnagi.com/blog/2023-09-07-transfluthrin/'
---

![mosquito_shield](/blog/mosquito-shield.jpg)        
*Mosquito Shield™* - a novel transfluthrin-based spatial repellent product for the malaria vector control market (source: [SC Johnson](https://www.scjohnson.com/en/a-healthier-world/sc-johnson-combats-malaria))

---

A few words on the greatest polyfluorinated pyrethroid I’ve ever written a thesis about – Transfluthrin. Back in my first foray into vector control, I was working in Mumbai at [Godrej](https://godrej.com), looking at resistance to the main active ingredient in their flagship insect repellent brand, [GoodKnight™](https://www.goodknight.in/), for my MSc dissertation. With randomised-controlled trials (RCTs) of the SC Johnson product *Mosquito Shield™* ongoing in sub-Saharan Africa, now seems an opportune time to discuss some of those findings.

Transfluthrin is a vapour-phase pyrethroid often used in domestic household products, such as sprays and liquid vapourisers. It repels mosquitoes, as well as incapacitating them to prevent host-seeking and blood-feeding. It's been particularly popular in the South Asian market for several years, with South America seemingly playing catch up. And, in recent years, it's also been explored as a novel vector control tool for malaria.

#### Metabolic resistance to transfluthrin?

As well as resulting in its high vapour pressure compared to common pyrethroids, the fluorination of transfluthrin may make it somewhat resistant to metabolic attack from Cytochrome P450s. Typically, pyrethroids are metabolised at the 4' position of the phenoxybenzyl ring. In Transfluthrin, however, the electro-negative fluorines pull electrons away from its benzyl ring, in theory preventing attack by electron-hungry cytochrome P450s. 

---
![tft_metabolism](/blog/tft_metabolism.png)  
A figure from my thesis '*Mechanisms of resistance to transfluthrin in mosquitoes', 2017*, supervised by Dr. David Weetman and Dr. Mark Paine. 

---

Earlier research from Bayer [[1]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0149738), had shown limited to no synergism with PBO in the FuMoz strain of *An. funestus*, suggesting that P450s could not metabolism transfluthrin. This was followed with a later study [[2]](https://www.sciencedirect.com/science/article/pii/S0048357523000214), showing that CYP6P9a/b could only metabolise transfluthrin very weakly, by targeting the gem-dimethyl group - as predicted in my thesis ;)

We demonstrated that this wasn't the case across species, however, with PBO synergising volatile transfluthrin in an Indian strain of *Culex quinquefasciatus*, and showing that the *An. gambiae* P450 CYP6P3 can metabolise transfluthrin *in vitro*. *In vitro* metabolism was much lower than for Deltamethrin, however, demonstrating transfluthrin's comparative ability to resist metabolic attack from P450s. It is not quite clear the role that other gene families, such as carboxylesterases or chemosensory proteins will play in transfluthrin resistance.

Given that PBO should still synergise transfluthrin in most resistant mosquito strains, the combination of PBO nets and a transfluthrin-based spatial repellent could be a useful combination for vector control. 


#### The effect of *VGSC* knockdown mutations on transfluthrin 

The "resistance-breaking" potential of transfluthrin doesn't end there. Although there is evidence that *Kdr* may reduce the sensitivity of mosquitoes to transfluthrin's repellent effects [[3]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4400042/) [[4]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8266078/), there are suggestions that *Kdr* mutations may not confer resistance to Transfluthrin and other poly-fluorinated pyrethroids, to the same degree as typical pyrethroids. A study showed that *Kdr* mutations in *Aedes aegypti* lead to lower levels of resistance to transfluthrin than with other pyrethroids [[5]](https://www.sciencedirect.com/science/article/pii/S0048357513001478), whilst other research has shown that House-fly *Super-Kdr* does not confer resistance to transfluthrin at all, potentially due to its shorter length [[6]](https://pubmed.ncbi.nlm.nih.gov/26691197/). 

It has even been hypothesised that vapour-phase pyrethroids may bypass cuticular resistance, via direct entry to the nervous system through insect spiracles [[7]](https://link.springer.com/article/10.1007/s13355-016-0443-2). Together, the above factors result in relatively low resistance ratios for transfluthrin when compared with standard pyrethroids [[8]](https://parasitesandvectors.biomedcentral.com/articles/10.1186/s13071-021-04997-8#Sec7), something we have also found with a range of pyrethroid-resistant mosquito strains at the School of Tropical Medicine (unpublished). It is important to note that resistance to transfluthrin is still likely to develop. 

#### Spatial repellent mixtures?

If spatial repellents are shown to be an effective tool for vector control, it will be important to raise discussions on how to maintain and increase the longevity of these products. Whilst writing this, I saw a study which found that transfluthrin does not activate olfactory neurons (like most repellents, such as DEET). Instead, its repellent properties are dependent on sodium channel activation [[9]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8266078/pdf/pntd.0009546.pdf). Interestingly, they also found that minuscule concentrations of transfluthrin synergise the effects of DEET and several other repellents! [[10]](https://www.sciencedirect.com/science/article/pii/S0048357523000524)

This finding suggests that using transfluthrin in a mixture with a non-pyrethroid vapour-phase repellent could be extremely effective, as well as extending the shelf-life of the products themselves. Another recent study identified repellent compounds with greater activity than the gold-standard DEET, and which have similar vapour pressures to transfluthrin [[11]](https://www.sciencedirect.com/science/article/pii/S0048357518303900) which could be ideal within such a mixture.

---

In 2023, we find ourselves in desperate need of novel vector control tools. Let's pray that spatial repellents can play an important role in reducing the burden of Malaria. 


My favourite polyfluorinated pyrethroid

---
title: "Ultra user-friendly bioinformatics pipelines pt.1"
shorttitle: "Ultra user-friendly bioinformatics pipelines pt.1"
slug: "Ultra-user-friendly-bioinformatics-pipelines-pt1"
date: 06/25/2023
thumbnail: '/thumbnails/jb.png'
tag: snakemake
canonicalUrl: 'https://sanjaycnagi.com/blog/2023-06-27-user-friendly-pipelines-pt1/'
---

Workflow managers such as [Snakemake](https://snakemake.github.io/) and [Nextflow](https://www.nextflow.io/) are wonderful tools - they allow us to build complex pipelines to reproducibly analyse genomic data with relative ease. These workflows run command line tools or scripts, performing some processing and analysis on input data, and writing outputs, tables and figures to results directories for the user to explore. Interpreting these genomic analyses, however, can be challenging, particularly for those who are less familiar with computational biology. To compound that, bioinformatic pipelines rarely have sufficient documentation, if at all. 

In this post, I wanted to share an interesting approach I've been using to present the results of computational workflows :) 

**Results web-books with Papermill, Notebooks, and Jupyter Book**

The approach involves the combination of a few semi-recent developments - in particular - [Papermill](https://github.com/nteract/papermill) and [Jupyter Book](https://jupyterbook.org/en/stable/intro.html), combined with Jupyter Notebooks.  

<details>
    <summary><em><b>What is a Jupyter Notebook?</b></em></summary>
  
    A Jupyter Notebook is an interactive computing environment that allows you to create and share documents containing live code, visualizations, and explanatory text. For those familiar with R, it is similar to R Markdown. It provides a web-based interface where you can write and execute code, typically Python. Jupyter Notebooks enable data analysis, experimentation, and collaboration in a convenient and flexible manner.
</details>

Papermill is a tool which allows Jupyter Notebooks to be parameterised and run from the command line - when we run the notebook, we can pass through some parameters. Surprisingly, standard Jupyter Notebooks do not support this - they are intended to be run interactively, cell by cell. Papermill means we can use Jupyter Notebooks in workflows directly like python scripts, and store the executed notebooks as outputs.

This is useful for a few reasons... 

Many people develop and debug in a Jupyter Notebook, and so this approach removes the need to convert to and from python scripts, saving valuable developer time. It also means that if you would like to perform a specific part of the analysis, it's easy to pull out a single notebook and apply it to your data. 

But the coolest thing comes when you integrate Jupyter Book. Jupyter Book is an awesome tool which builds html web pages from a collection of Jupyter Notebooks, a table of contents, and a configuration file. It's now widely used for building software documentation, such as in [malariagen_data](https://malariagen.github.io/vector-data/ag3/api.html), or the [Jupyter Book docs](https://jupyterbook.org/en/stable/start/example-book.html) themselves! Importantly, the Jupyter Notebooks can contain executed code with tables and figures, as well as markdown text. This means we can include our results notebooks for each step of the analysis, with clear descriptions on what the analysis does, and how to interpret the results! Interactive plots can be generated using the powerful plotly and bokeh libraries, giving end-users the chance to dive deep into the data, all within the familiar realm of a web page.

I and our collaborators at UVRI, Trevor Mugoya and Edward Lukyamezi, have recently been exploring this idea in [AmpSeeker](https://github.com/sanjaynagi/AmpSeeker), a workflow we are developing for amplicon sequencing data. I must say, it feels really nice to be able to browse all the analyses in one place. An example of the results book is shown below. 

<figure>
    <div align="center">
    <iframe width="560" height="315" src="https://www.youtube.com/embed/mt-AZeYz50k" title="YouTube video player" frameBorder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowFullScreen></iframe>
    </div> 
    <figcaption><em>An example of the (draft) AmpSeeker results book. If you like Hindi music or Anime, I'd recommend checking out the user guide placeholder, or the film that it's from :) </em></figcaption>
</figure>

I hope that others might find this a useful approach to building workflows. Exciting as it is, it now means I have the task of converting [rna-seq-pop](https://github.com/sanjaynagi/rna-seq-pop) to this way of workflow infrastructure! Wish me luck!

2025 update: An example of the AmpSeeker results book is now online [here](https://sanjaynagi.github.io/agvampir002-results/intro.html)


Ultra user-friendly bioinformatics pipelines pt.1

---
title: "Parantha reviews"
shorttitle: "Parantha reviews"
slug: "Paranthas-Are-The-Best"
date: "03/30/2023"
thumbnail: '/thumbnails/parantha.png'
tag: food
canonicalUrl: 'https://sanjaycnagi.dev/blog/2023-03-30-parantha-reviews/'
---

![parantha](/blog/parantha_mj.png)
*Parantha heaven.*

Hello, fellow paratha/parantha/parotta enthusiasts! I'm here to share my thoughts 
and reviews on some of the best paranthas I've had the pleasure of 
devouring. As a self-proclaimed parantha addict, I have traveled far 
and wide in search of the perfect parantha. And let me tell you, it's 
been a delicious journey.

**Reviews**

**Anands sweets, LS6 Leeds**  
Anands is my favourite place to visit any time I return to Leeds. It used to be an indian sweet shop, but a few years back they turned it into a cafe, serving a whole range of vegetarian dishes. Its an amazing place which brings me joy just to sit in there, and the channa masala is the best I've ever tasted. The last time I went, the paranthas however, could be better (I'm sorry, Anands!).  
Rating: 6.5/10

**Chaiwala, Bold Street L1 Liverpool**  
These Indian street food places seem to have popped up everywhere. Im not complaining. Chai was delicious, and so was the Samosa pav I had. The parantha could have done with more flavour (I suppose this has to cater to the British palate), but the yoghurt and raita was strangely good. Enjoyable.  
Rating: 6.5/10

**Chai walay, LS6 Leeds**  
I actually found this place whilst trying to visit Anands, on the way back to Liverpool one morning. Anands was closed, so I stopped in my car and desperately googled 'Paranthas Leeds' and this place came up. I was excited. I remember it clearly, a really sunny winters day. And it was good. A massive aloo Parantha, very flaky and very delicious. A bit too oily, though I suspect that may have aided the overall taste. Chai was delicious.  
Rating: 7.5/10

**Zaffran, L8 Liverpool**  
This is one of those ghost restuarants - its just a kitchen featured on JustEat and Deliveroo, you cant sit inside. This poses a problem for me, as I like to eat the Parantha fresh out of the tandoor. The only option - drive right up to the entrance, and sit in my car to eat it. Greasy fingers everywhere, not ideal. The kulchas were nice, but the aloo parantha was fried (not cooked in a tandoor) and I'm fairly sure it came from a frozen packet.   
Rating: 6.1/10

**Paratha Hut, Levenshulme, Manchester**    
I've been meaning to go here for ages. One day I took a cheeky 45 min detour on the way back to leeds. this is a little hut in the corner of a car wash. 
There are many options, and (aloo) paranthas were quite delicious!

Rating 7.8/10

**Amritsari Naan stall, Banga, Punjab**  
Delicious and cheap, made in the tandoor. 3 minute walk from the house. love it. 

Rating 10/10


Parantha reviews

---
title: Parallelising freebayes with snakemake
shorttitle: Parallelising freebayes with snakemake
slug: parallelising-freebayes-with-snakemake
date: 11/01/2021
thumbnail: '/thumbnails/snakemake.png'
tag: genetics
canonicalUrl: 'https://sanjaycnagi.com/blog/2021-01-11-freebayes-parallel/'
---

[`freebayes`](https://github.com/freebayes/freebayes) is a bayesian haplotype-based variant caller, used widely in genomics. As with many variant callers, it is not readily parallelised, but can be done so by splitting the genome into smaller chunks, calling them separately, and subsequently combining the chunks together.

A wrapper for freebayes, [`freebayes-parallel`](https://github.com/freebayes/freebayes/blob/master/scripts/freebayes-parallel), does exactly this, making use of `gnu-parallel`. However, this approach has a major limitation:

* When a chunk is completed, that cpu core will not move onto the next region until all cores have completed their respective chunk. This is particularly problematic in regions of variable coverage, and so one can attempt to split the genome into regions of roughly equal coverage. Unfortunately, this still results in many cores being unused for substantial periods of time.

I was implementing a `freebayes` variant calling step in a snakemake RNA-Sequencing pipeline I was writing (more on this later), and wanted to parallelise freebayes, without the above limitation. 

To do so, we can write a snakemake rule (below) which runs an [R script](https://github.com/sanjaynagi/rna-seq-ir/blob/master/workflow/scripts/GenerateFreebayesParams.R) to read in the genome index (.fai) file, and output multiple bed files, breaking the genome into chunks of equal size. By using an extra snakemake wildcard, the index of each genome chunk, we can produce, and supply freebayes with different bed files. Finally, after concatenating the vcfs with `bcftools concat` it is also important to stream the output through `vcfuniq`, to ensure there are no duplicate calls at the region overlaps. 

The benefit of this, is that snakemake will automatically run the next job when each chunk is complete, reducing overall computation time as compared with `freebayes-parallel`. Importantly, it also allows us to perform joint, multi-sample calling, which is one of the main benefits of using freebayes in the first place. 


```python
# Read in the desired number of genome chunks from the config.yaml, and arange a sequence 1-n. 
chunks = np.arange(1, config['chunks'])

# Note - in this case the script also produces some other files for Freebayes
rule GenerateFreebayesParams:
    input:
        ref = config['ref']['genome'],
        index = config['ref']['genome'] + ".fai",
        bams = expand("resources/alignments/{sample}.bam", sample=samples)
    output:
        bamlist = "resources/bam.list",
        pops = "resources/populations.tsv",
        regions = expand("resources/regions/genome.{chrom}.region.{i}.bed", chrom=config['chroms'], i = chunks) # bed files 
    log:
        "logs/GenerateFreebayesParams.log"
    params:
        metadata = config['samples'],
        chroms = config['chroms'],
        chunks = config['chunks']
    conda:
        "../envs/diffsnps.yaml"
    script:
        "../scripts/GenerateFreebayesParams.R"
```

*Update - 09/03/2021 - In the beautiful nature of open source software development, I have now written a parallelisation section in the [freebayes documentation](https://github.com/freebayes/freebayes) :)*

I'll leave you with a song I've been enjoying recently. Happy variant calling.

<iframe width="560" height="315" src="https://www.youtube.com/embed/1fBEEANitDY" frameBorder="0" allow="autoplay; encrypted-media" allowFullScreen></iframe>



Parallelising freebayes with snakemake

More Posts

⬅️ Previous: Parantha reviews