It's of huge value to be able to predict CRISPR target efficiency ahead of time. Xu et al have published an analysis of multiple guide RNA data sets and extracted what they claim is an improved model for target cleavage efficiency prediction. This data is all for the S.pyogenes native Cas9.
Xu et al. Sequence determinants of improved CRISPR sgRNA design.
Genome Res. 2015 Aug;25(8):1147-57. doi: 10.1101/gr.191452.115. Epub 2015 Jun 10.
Their paper is important to me for several reasons. First, they have examined two independently-published "large" guide RNA data sets that had mutagenesis-efficiency data, which allows more confidence that trends of sequence preferences are holding up across labs and platforms. Second, they validated their predictive model on a small (in comparison to genome-wide, but still not bad) data set of new CRISPR targets and corresponding guide RNAs. Third, they did "in silico validation" by turning their model loose on another target/indel data set, and showed improved performance of their predictive model over a previously published model. See ROC curves in Fig. 4b. This allows an ability to weed out "50-60% of the inefficient sgRNAs…at the cost of 10-20% of efficient sgRNAs misclassified." That is, misclassified as inefficient.
For those who are interested in genome-wide knockout screening experiments these sorts of models are very good for increasing efficiency of the screens. Moreover, if you wish to knockout particular genes, it will allow you to test or use fewer targets per gene till you find one that works well.
OK, now the sobering reality for nerds like me is that predictive models, even with great ROC curves, have false positive and false negative rates that will bite you in the behind on a regular basis if you are designing large projects around the function of single CRISPR targets. I'm still facing this issue for precision knock-in projects, for which there are often not many targets to choose from. And with transgenic mice we always want the efficiency as high as possible. For cell lines, hey, that's not as much a problem if you can subclone the edited lines.
But let's get back to the CRISPR target sequence preferences. The bottom line here is that the last three bases of the protospacer seem to have the most influence on cleavage efficiency, with a C preferred at the -3 position (relative to the PAM), and G's at -2 and -1. Also, G's are helpful at the -17 to -14 region, while A's are good at the -12 to -9 region. Finally, a C seems helpful at +1 following the PAM.
Looking back at the Wang et al paper, they also reported a preference for A's at around -10 to -8, and essentially a "GCRR" preference for bases -4 to -1. This makes sense since Xu have based their model partly on the the Wang data. However, Xu et al point out that the apparent G preference at the -20 position is probably an artifact of the Wang sgRNA library in that these may have had increased efficiency due to enhanced transcription, not activity per se.
General GC-richness in the protospacer is known to correlate with CRISPR mutagenesis. Could that just be driven by the GC-rich preferences of the last few bases? Otherwise, GC-richness doesn't clearly emerge from the Xu model, at least to me anyway. I took a crack at this by looking at a data set from Gagnon et al, mostly because I could handle the size of their sgRNA list in an excel spreadsheet without exploding my own brain or my iMac. My impression is that GC richness is still "good" even when the last 4 bases of the protospacer are similar. Here's an example. From Gagnon et al's list of 122 sgRNAs with indel numbers, I ranked them according to how well they matched the "GCRR" of the last four bases. I based this on the Wang et al paper although I think it is very similar to that corresponding part of the Xu model. My "score" ranged from 0 to 7. Then I examined the subset of 30 targets that all had a same "score" of 5. So these targets are all controlled, at least kinda sorta, for their variation in bases -4 to -1 in that they have similar strength of matching to the "GCRR" motif. Finally, I graphed the indel frequencies versus the GC content of the first 16 bases of their protospacers. Here is the data. y axis= indel frequency (in a zebrafish model), x axis= # of GC base pairs in first 16 bases.
This ain't close to something I'd submit for peer review but I do see a trend. GC richness in the first 16 bases of the protospacer correlates with cleavage efficiency, even within a group of targets for which the 3' ends are similar. So for now - I will continue to prefer overall GC-rich targets that also have at least some matching to the "CGG", or "GCRR", motif at the very 3' end.
Also, "CGG" matches the high-efficiency 3' end reported by Farboud and Meyer so there's another corroboration.
So, the answer to my previous post "Are there sequence preferences near the 3' end of the #CRISPR protospacer? …" is, yes. And this holds up for S.pyogenes Cas9 when used across human, mouse, fish and C.elegans models.
A final note - these data all refer to cleavage and/or knockout efficiencies. CRISPRi and CRISPRa screens, which do not lead to or require DNA cleavage, have different sequence preferences which Xu et al also modeled in detail.