Design principles of nonspecific transcription factor-DNA binding in eukaryotic genomes


Transcription factors (TFs) are proteins that regulate gene expression in both prokaryotic (e.g. bacteria) and eukaryotic (e.g. yeast or fly) cells. TFs bind regulatory promoter regions of DNA in the genome. It is commonly accepted that each transcription factor binds specifically a relatively small set of DNA sequences called TF binding motifs or TF binding sites (TFBSs). A TF binds its specific binding motifs with a higher affinity than other genomic sequences of the same length. A typical length of TF binding motif varies between 6 and 20 nucleotides. Recent high-throughput measurements of TF binding preferences on a genome-wide scale have challenged the classical picture of TF specificity. Such experiments measured binding preferences of more than a hundred transcription factors to tens of thousands of DNA sequences and demonstrated a high level of multi-specificity in TF binding. It has been also pointed out that weak-affinity TF binding motifs are essential for gene expression regulation.


A key question is how TFs find their specific binding sites in a background of billions non-specific sites in a cell genome. This question was first addressed theoretically in seminal works of Berg, Winter, and von Hippel. The central idea of this approach is that the search process is a combination of three-dimensional and one-dimensional diffusion. It has been shown in different theoretical models that one-dimensional diffusion (in different models termed `sliding' or `hopping') facilitates the search process under certain conditions.


Despite the success of these phenomenological models, a complete understanding of the search process phenomena is still lacking. In particular, one of the key, open questions is what makes a TF switch from three-dimensional diffusion to one-dimensional sliding in specific genomic locations. Invariably, an assumption is made about the existence of some non-specific binding sites that bring TFs to the vicinity of DNA for one-dimensional sliding. This assumption is a key component of all theoretical models, yet the molecular origin of this effect is not understood. Recent single-molecule experimental studies undoubtedly show that different DNA-binding proteins spend the majority of their time non-specifically bound and diffusing along DNA. The question is what biophysical mechanism provides such non-specific attraction towards genomic DNA and regulates the strength of this attraction at a given genomic location? 


We predict that DNA sequence correlations statistically regulate non-specific TF-DNA binding preferences. Depending on the symmetry and length-scale of sequence correlations, the non-specific binding affinity can be either enhanced or reduced. In particular, we show that homo-oligonucleotide sequence correlations, where nucleotides of the same type are clustered together (such as poly(dA:dT) and poly(dC:dG) tracts) generically reduce the non-specific TF-DNA binding free energy thus enhancing the binding affinity. Sequence correlations where nucleotides of different types are alternating, lead to an opposite effect, increasing the non-specific TF-DNA binding free energy. Correlation analysis of the yeast genome regulatory sequences suggests that the predicted design principle is exploited at the genome-wide level, in order to increase the strength of non-specific binding at these regulatory genomic locations. We suggest, therefore, that in addition to all known signals, genomic DNA encodes its intrinsic propensity for nonspecific binding to TFs.