Although ultrahigh-throughput RNA-sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-seq data from hundreds or thousands of samples, often collected at multiple locations and from diverse tissues. We examine the effects of different preprocessing methods on downstream analyses. We find analysis of large RNA-seq data sets requires careful quality control and that one account for sparsity due to the heterogeneity intrinsic in multi-group studies. We motivate our results using the GTEx cohort and look at the differential pathways of cell lines from their progenitor tissues.