The role of genomic sequence in directing the packaging of eukaryotic genomes into chromatin has been the subject of considerable recent debate. A new paper from Tillo and Hughes shows that the intrinsic thermodynamic preference of a given sequence in the yeast genome for the histone octamer can largely be captured with a simple model, and in fact is mostly explained by %GC. Thus, the rules for predicting nucleosome occupancy from genomic sequence are much less complicated than has been claimed.
Packaging of eukaryotic DNA into nucleosomes has profound effects on DNA-templated processes. The 147 bp of DNA wrapped around the histone octamer is generally believed to be less accessible to DNA-binding proteins than is the DNA between nucleosomes. The positioning of nucleosomes relative to underlying sequences therefore has considerable implications for the regulation of gene expression, and understanding where nucleosomes are located and the rules underlying nucleosome positioning are key questions in understanding transcriptional control.
The recent revolution in genomics technologies has made genome-wide mapping of nucleosome positions possible in organisms ranging from budding yeast to humans. These genome-wide maps provide us with a multitude of hypotheses regarding the role of nucleosome positioning in gene regulation (reviewed in [1-3]). Perhaps one of the biggest surprises from even the earliest of these mapping efforts (in Saccharomyces cerevisiae) was the observation that the majority of nucleosomes are 'well positioned', that is, that nucleosomes occupy the same position (in many cases, to within mapping precision) in the majority of cells in a mixed population in the mid-log phase of growth (that is, actively growing unsnychronized yeast). This was a surprise to many investigators for many reasons, not least because the a priori expectation for a general packaging-protein complex would include a lack of sequence specificity. Furthermore, yeast promoters turned out to look very similar to one another, with a nucleosome-depleted 'nucleosome-free region' (NFR) observed at the majority of yeast promoters. This unanticipated level of order then raises the question of what underlies the remarkably consistent chromatin packaging in cell populations. Work recently published in BMC Bioinformatics by Tillo and Hughes  provides one surprisingly simple answer to this question, suggesting that the rules for predicting nucleosome occupancy from genomic sequence may be much less complicated than had been widely supposed.
The positioning of nucleosomes in vivo
Some properties of strongly pro- and antinucleosomal DNA sequences had already been elaborated in the pre-genomic era, but given the limited DNA sequencing capacity available, the extent to which genomic sequence programmed chromatin structure in vivo was unknown. Nucleosome positioning at any given locus can be ascribed to either local cis sequence cues or trans-acting protein factors (or, of course, both). Chromatin-remodeling complexes can move or evict nucleosomes, providing the canonical examples of trans-acting factors . Conversely, it has been known for decades that there is at least some variation between DNA sequences in their affinity for the histone octamer . The basic insight that led to this realization came originally from the observation that some DNA sequences were more or less flexible. Because DNA is sharply bent around the histone octamer, stiff sequences should be less favorable for nucleosomal incorporation, whereas flexible sequences or intrinsically curved sequences would be more favorable sites for octamer placement. In early studies, polyA sequences were shown to be intrinsically stiff, apparently owing to systems of 'bifurcated' hydrogen bonds between a given A and two Ts on the opposite strand. Conversely, because AT dinucleotides potentially introduce a kink in DNA, spacing of AT dinucleotides every 10 bp would be expected to result in DNA with a consistent curvature, reducing the free-energy cost of bending these sequences and resulting in more thermodynamically stable nucleosomes.
Since these early observations, the extent to which genomic sequence directs chromatin structure through intrinsic preferences has been an active area of investigation. One approach involved investigations in vivo - seminal studies from the Struhl group in S. cerevisiae showed that nucleosome depletion at the HIS3 promoter could be enhanced or diminished by adding or removing polyA sequences . However, in vivo studies are subject to the criticism that it is nearly impossible to exclude possible effects of an unknown trans-acting polyA-binding protein, although in S. cerevisiae this appears to be unlikely to account for the observations.
A general way to demonstrate that a given sequence intrinsically favors or disfavors nucleosome incorporation is to carry out in vitro nucleosome-reconstitution assays using nothing but histones, DNA and buffer. For instance, Struhl and colleagues showed that nucleosome depletion at the HIS3 promoter can be recapitulated in vitro , while Korber and colleagues showed that the PHO5 promoter can only be assembled into its in vivo packaging in the presence of yeast extract . Wide-ranging studies from various groups over decades has provided a great deal of insight into the rules underlying histone-DNA interactions. For example, selections for tight-binding sequences provided the chromatin community with the best-defined 'pronucleosomal' sequence, the 'Widom601' sequence (identified by Jon Widom and colleagues), which has been used in countless in vitro studies .
The subsequent sequencing of numerous genomes and the advent of genomic nucleosome maps provided fodder for a range of computational studies (reviewed in [1-3]). Initial studies focused on pronucleosomal sequences with a 10-nucleotide periodicity of AT, AA, or TT dinucleotides. Two early studies agreed that such sequences were enriched at the +1 nucleosome position, but these studies did not capture the dominant feature of yeast promoters - nucleosome depletion at the so-called nucleosome-free region. Subsequent studies from many groups improved on these models by systematically incorporating antinucleosomal sequences (such as polyA and others) that are prevalent at yeast promoters and appear to be a major determinant of nucleosome-free regions in vivo.
In vitro reconstitution studies reveal 'programmed' nucleosome-free regions
All of the studies noted above focused on predicting in vivo nucleosome positions, since in vitro reconstitution data were sparse. More recently, two groups have carried out genome-wide experimental studies of intrinsic nucleosome-binding preferences [10,11]. These studies differed in their conclusions, but the data are quite similar. In essence, in vitro reconstitution of yeast genomic DNA into nucleosomes captures nucleosome depletion at yeast promoters, but little else (Figure 1). The periodic spacing of AA/AT/TT dinucleotides that is statistically enriched at the in vivo +1 position does not appear to play a general role in positioning the +1 nucleosome, and more probably 'fine-tunes' rotational positioning of nucleosomes (that is, positioning to ±1 nucleotide after large-scale ±5 nucleotide positioning has been established by other means ).
Figure 1. In vitro reconstitutions highlight yeast promoter nucleosome depletion. In vitro reconstitution data from Kaplan et al.  are shown in pink; data from our own in vivo nucleosome mapping  are in blue for comparison. Deep sequencing reads were mapped to the S. cerevisiae genome and extended to 140 bp (so each short read was extended to nucleosome length). Data were normalized for sequencing depth, and data for around 5,000 genes with well-defined transcriptional start sites (TSSs) were aligned and averaged over all genes for each dataset. Notably, the nucleosome depletion at yeast promoters is visible as a prominent valley in both datasets, whereas the stereotyped positioning of the +1 nucleosome relative to the transcription start site is clearly visible as a prominent peak only in the in vivo data. Red and green rectangles indicate regions previously proposed to be enriched for anti- and pronucleosomal sequences, respectively.
Kaplan et al.  argue on the basis of a high (around 0.74) correlation coefficient between the in vitro and in vivo datasets that in vitro reconstitutions globally capture in vivo chromatin architecture. However, as Stein et al.  recently showed, the use of correlation coefficients is misleading because they are subject to the 'influential point effect' - in other words, outlying points drive correlations (even if the bulk of the data are uncorrelated), and in the case of chromatin structure these outlying points correspond to the dramatic nucleosome depletion at promoters. Indeed, Zhang et al.  showed very poor correspondence between in vivo nucleosome positioning and in vitro reconstitution data. Thus, we believe that all extant data support a view in which very little nucleosome positioning information is intrinsically encoded but that the yeast genome does program nucleosome depletion at promoters via antinucleosomal sequences. Below, we will discuss sequence models that attempt to recapitulate the in vitro reconstitution results, but readers should bear in mind that nucleosome occupancy rather than nucleosome positioning is being addressed when correlation coefficients are being used as the summary statistic.
Trimming the fat from computational models
In addition to establishing that intrinsic 'programming' of chromatin architecture is largely limited to nucleosome depletion at promoters, in vitro reconstitution datasets also enable more direct testing of computational models of intrinsic sequence preferences for nucleosomes. For instance, Kaplan et al.  generated a model to predict the intrinsic preference of a given 147-bp sequence for nucleosomes. This model consists of a position-independent component (that is, a component that does not depend on location within the 147-bp sequence), and a position-dependent component. The position-independent component was based on measured occupancy of all 5 mer (5-nucleotide) sequences - some 5 mers, such as AAAAA, are rarely found in nucleosomes and thus a sequence carrying AAAAA would be weighted unfavorably for nucleosome formation. The position-dependent term builds on the dinucleotide periodicity noted above: for each position relative to the dyad axis, the frequency of each dinucleotide was calculated from reconstitution data, and then 147-bp sequences were scored for their match to this distribution. Predictions of this model correlated very well (0.89) with in vitro reconstitution data, suggesting that the majority of the affinity of a given sequence for the histone octamer is predictable from sequence (again, note that the use of correlation coefficients emphasizes outlier sequences, such as the polyAs found at promoters).
The new work by Tillo and Hughes  now extends the search for sequence rules underlying nucleosome occupancy. Given the large number (more than 2,000) of parameters included in the Kaplan model (all 5 mers, plus dinucleotide frequencies across 127 bp), Tillo and Hughes asked whether a simpler model might capture most of the occupancy information encoded in DNA sequence. They used a linear regression algorithm called Lasso to identify features that predict nucleosome occupancy in the Kaplan dataset. Specifically, after selecting a large number of candidate features (straightforward candidate features such as %GC content, not complex eigenvectors such as generated by principal component analysis), Lasso creates a linear combination model with an emphasis on setting as many coefficients to zero as possible. The resulting model(s) had very few parameters, with the model selected for study having only 14 features.
The resulting sequence model captures in vitro nucleosome occupancy data nearly as well (R = 0.86) as the Kaplan model (R = 0.89), indicating that a small number of sequence features describes most of the variation in nucleosome occupancy in in vitro reconstitutions. Close examination of the most important features of this model indicates that %GC and polyA runs are the two dominant factors, with a simple model using just these features exhibiting a correlation of 0.72 with in vitro data. Indeed, a model based on %GC alone showed a correlation of 0.71 with the in vitro data. Much of this is likely to be a consequence of the fact that many of the other 13 parameters are correlated with %GC - AAAA is obviously unlikely in high-GC sequences. Furthermore, many features of DNA three-dimensional structure (the authors specifically note 'propeller twist' and 'slide') are also correlated with %GC, and thus GC content seems to provide a single feature that captures many related structural characteristics that are important for nucleosome stability. Additional features in the model (AAAA, propeller twist, and so on) are then proposed to indicate features that are important for nucleosome formation but are not entirely captured by GC content. It is important to consider that there may also be a confounding effect of genome structure - yeast promoters are AT-rich and nucleosome-depleted, so %GC will naturally correlate with nucleosome depletion whether it is a cause or a consequence - but the authors also compare their model with data from synthetic DNA reconstitution data and still obtain significant correlations with the in vitro data.
These results have a number of important implications for thinking about how chromatin structure is 'programmed'. First, as there are no terms for dinucleotide periodicity in the Lasso model, these results support the finding of many groups arguing that nucleosome exclusion by polyA and related sequences is the dominant feature in in vitro nucleosome-reconstitution assays. Second, the lack of support for 'pronucleosomal' sequences as major positioning cues in the reconstitution data re-emphasizes the status of statistical positioning as the best hypothesis to explain why chromatin is so well ordered in vivo. Third, because the Tillo and Hughes model also performs reasonably well on nucleosome-mapping data from Caenorhabditis elegans, it may prove portable for analysis of genomes other than yeast. Finally, these results have important implications for genome structure and evolution, as GC content varies between organisms and across genomes (CpG islands being a prominent example).
The recent lively interest in the idea of a 'nucleosome code' that might program the packaging of the genome thus seems somewhat excessive. Tillo and Hughes help clear the air to some extent, showing that simple models very effectively capture the majority of the behavior of in vitro nucleosome-reconstitution experiments. Programming nucleosome depletion with AT-rich sequences at promoters is confirmed as a key regulatory strategy in budding yeast and perhaps C. elegans. These insights may help guide questions about the evolution of chromatin packaging at specific loci, and about the regulatory strategies available to promoters with large nucleosome-free regions 'programmed' in cis.
We thank M Radman-Livaja and N Friedman for critical comments on this minireview. OJR is supported in part by a Career Award in the Biomedical Sciences from the Burroughs Wellcome Fund, and by grants from NIGMS and HFSP.
Dev Biol 2009.
doi:10.1016/j.ydbio.2009.06.012PubMed Abstract | Publisher Full Text
Nucleic Acids Res 2009.
Genome Res 2009.
doi:10.1101/gr.098509.109PubMed Abstract | Publisher Full Text | PubMed Central Full Text