cargo/sources/registry/
mod.rs

1//! A `Source` for registry-based packages.
2//!
3//! # What's a Registry?
4//!
5//! Registries are central locations where packages can be uploaded to,
6//! discovered, and searched for. The purpose of a registry is to have a
7//! location that serves as permanent storage for versions of a crate over time.
8//!
9//! Compared to git sources, a registry provides many packages as well as many
10//! versions simultaneously. Git sources can also have commits deleted through
11//! rebasings where registries cannot have their versions deleted.
12//!
13//! # The Index of a Registry
14//!
15//! One of the major difficulties with a registry is that hosting so many
16//! packages may quickly run into performance problems when dealing with
17//! dependency graphs. It's infeasible for cargo to download the entire contents
18//! of the registry just to resolve one package's dependencies, for example. As
19//! a result, cargo needs some efficient method of querying what packages are
20//! available on a registry, what versions are available, and what the
21//! dependencies for each version is.
22//!
23//! One method of doing so would be having the registry expose an HTTP endpoint
24//! which can be queried with a list of packages and a response of their
25//! dependencies and versions is returned. This is somewhat inefficient however
26//! as we may have to hit the endpoint many times and we may have already
27//! queried for much of the data locally already (for other packages, for
28//! example). This also involves inventing a transport format between the
29//! registry and Cargo itself, so this route was not taken.
30//!
31//! Instead, Cargo communicates with registries through a git repository
32//! referred to as the Index. The Index of a registry is essentially an easily
33//! query-able version of the registry's database for a list of versions of a
34//! package as well as a list of dependencies for each version.
35//!
36//! Using git to host this index provides a number of benefits:
37//!
38//! * The entire index can be stored efficiently locally on disk. This means
39//!   that all queries of a registry can happen locally and don't need to touch
40//!   the network.
41//!
42//! * Updates of the index are quite efficient. Using git buys incremental
43//!   updates, compressed transmission, etc for free. The index must be updated
44//!   each time we need fresh information from a registry, but this is one
45//!   update of a git repository that probably hasn't changed a whole lot so
46//!   it shouldn't be too expensive.
47//!
48//!   Additionally, each modification to the index is just appending a line at
49//!   the end of a file (the exact format is described later). This means that
50//!   the commits for an index are quite small and easily applied/compressible.
51//!
52//! ## The format of the Index
53//!
54//! The index is a store for the list of versions for all packages known, so its
55//! format on disk is optimized slightly to ensure that `ls registry` doesn't
56//! produce a list of all packages ever known. The index also wants to ensure
57//! that there's not a million files which may actually end up hitting
58//! filesystem limits at some point. To this end, a few decisions were made
59//! about the format of the registry:
60//!
61//! 1. Each crate will have one file corresponding to it. Each version for a
62//!    crate will just be a line in this file.
63//! 2. There will be two tiers of directories for crate names, under which
64//!    crates corresponding to those tiers will be located.
65//!
66//! As an example, this is an example hierarchy of an index:
67//!
68//! ```notrust
69//! .
70//! ├── 3
71//! │   └── u
72//! │       └── url
73//! ├── bz
74//! │   └── ip
75//! │       └── bzip2
76//! ├── config.json
77//! ├── en
78//! │   └── co
79//! │       └── encoding
80//! └── li
81//!     ├── bg
82//!     │   └── libgit2
83//!     └── nk
84//!         └── link-config
85//! ```
86//!
87//! The root of the index contains a `config.json` file with a few entries
88//! corresponding to the registry (see `RegistryConfig` below).
89//!
90//! Otherwise, there are three numbered directories (1, 2, 3) for crates with
91//! names 1, 2, and 3 characters in length. The 1/2 directories simply have the
92//! crate files underneath them, while the 3 directory is sharded by the first
93//! letter of the crate name.
94//!
95//! Otherwise the top-level directory contains many two-letter directory names,
96//! each of which has many sub-folders with two letters. At the end of all these
97//! are the actual crate files themselves.
98//!
99//! The purpose of this layout is to hopefully cut down on `ls` sizes as well as
100//! efficient lookup based on the crate name itself.
101//!
102//! ## Crate files
103//!
104//! Each file in the index is the history of one crate over time. Each line in
105//! the file corresponds to one version of a crate, stored in JSON format (see
106//! the `RegistryPackage` structure below).
107//!
108//! As new versions are published, new lines are appended to this file. The only
109//! modifications to this file that should happen over time are yanks of a
110//! particular version.
111//!
112//! # Downloading Packages
113//!
114//! The purpose of the Index was to provide an efficient method to resolve the
115//! dependency graph for a package. So far we only required one network
116//! interaction to update the registry's repository (yay!). After resolution has
117//! been performed, however we need to download the contents of packages so we
118//! can read the full manifest and build the source code.
119//!
120//! To accomplish this, this source's `download` method will make an HTTP
121//! request per-package requested to download tarballs into a local cache. These
122//! tarballs will then be unpacked into a destination folder.
123//!
124//! Note that because versions uploaded to the registry are frozen forever that
125//! the HTTP download and unpacking can all be skipped if the version has
126//! already been downloaded and unpacked. This caching allows us to only
127//! download a package when absolutely necessary.
128//!
129//! # Filesystem Hierarchy
130//!
131//! Overall, the `$HOME/.cargo` looks like this when talking about the registry:
132//!
133//! ```notrust
134//! # A folder under which all registry metadata is hosted (similar to
135//! # $HOME/.cargo/git)
136//! $HOME/.cargo/registry/
137//!
138//!     # For each registry that cargo knows about (keyed by hostname + hash)
139//!     # there is a folder which is the checked out version of the index for
140//!     # the registry in this location. Note that this is done so cargo can
141//!     # support multiple registries simultaneously
142//!     index/
143//!         registry1-<hash>/
144//!         registry2-<hash>/
145//!         ...
146//!
147//!     # This folder is a cache for all downloaded tarballs from a registry.
148//!     # Once downloaded and verified, a tarball never changes.
149//!     cache/
150//!         registry1-<hash>/<pkg>-<version>.crate
151//!         ...
152//!
153//!     # Location in which all tarballs are unpacked. Each tarball is known to
154//!     # be frozen after downloading, so transitively this folder is also
155//!     # frozen once its unpacked (it's never unpacked again)
156//!     src/
157//!         registry1-<hash>/<pkg>-<version>/...
158//!         ...
159//! ```
160
161use std::borrow::Cow;
162use std::collections::BTreeMap;
163use std::collections::HashSet;
164use std::fs::{File, OpenOptions};
165use std::io::Write;
166use std::path::{Path, PathBuf};
167
168use flate2::read::GzDecoder;
169use log::debug;
170use semver::{Version, VersionReq};
171use serde::Deserialize;
172use tar::Archive;
173
174use crate::core::dependency::{DepKind, Dependency};
175use crate::core::source::MaybePackage;
176use crate::core::{InternedString, Package, PackageId, Source, SourceId, Summary};
177use crate::sources::PathSource;
178use crate::util::errors::CargoResultExt;
179use crate::util::hex;
180use crate::util::into_url::IntoUrl;
181use crate::util::{CargoResult, Config, Filesystem};
182
183const PACKAGE_SOURCE_LOCK: &str = ".cargo-ok";
184pub const CRATES_IO_INDEX: &str = "https://github.com/rust-lang/crates.io-index";
185pub const CRATES_IO_REGISTRY: &str = "crates-io";
186const CRATE_TEMPLATE: &str = "{crate}";
187const VERSION_TEMPLATE: &str = "{version}";
188
189pub struct RegistrySource<'cfg> {
190    source_id: SourceId,
191    src_path: Filesystem,
192    config: &'cfg Config,
193    updated: bool,
194    ops: Box<dyn RegistryData + 'cfg>,
195    index: index::RegistryIndex<'cfg>,
196    yanked_whitelist: HashSet<PackageId>,
197}
198
199#[derive(Deserialize)]
200pub struct RegistryConfig {
201    /// Download endpoint for all crates.
202    ///
203    /// The string is a template which will generate the download URL for the
204    /// tarball of a specific version of a crate. The substrings `{crate}` and
205    /// `{version}` will be replaced with the crate's name and version
206    /// respectively.
207    ///
208    /// For backwards compatibility, if the string does not contain `{crate}` or
209    /// `{version}`, it will be extended with `/{crate}/{version}/download` to
210    /// support registries like crates.io which were created before the
211    /// templating setup was created.
212    pub dl: String,
213
214    /// API endpoint for the registry. This is what's actually hit to perform
215    /// operations like yanks, owner modifications, publish new crates, etc.
216    /// If this is None, the registry does not support API commands.
217    pub api: Option<String>,
218}
219
220/// A single line in the index representing a single version of a package.
221#[derive(Deserialize)]
222pub struct RegistryPackage<'a> {
223    name: InternedString,
224    vers: Version,
225    #[serde(borrow)]
226    deps: Vec<RegistryDependency<'a>>,
227    features: BTreeMap<InternedString, Vec<InternedString>>,
228    cksum: String,
229    /// If `true`, Cargo will skip this version when resolving.
230    ///
231    /// This was added in 2014. Everything in the crates.io index has this set
232    /// now, so this probably doesn't need to be an option anymore.
233    yanked: Option<bool>,
234    /// Native library name this package links to.
235    ///
236    /// Added early 2018 (see https://github.com/rust-lang/cargo/pull/4978),
237    /// can be `None` if published before then.
238    links: Option<InternedString>,
239}
240
241#[test]
242fn escaped_char_in_json() {
243    let _: RegistryPackage<'_> = serde_json::from_str(
244        r#"{"name":"a","vers":"0.0.1","deps":[],"cksum":"bae3","features":{}}"#,
245    )
246    .unwrap();
247    let _: RegistryPackage<'_> = serde_json::from_str(
248        r#"{"name":"a","vers":"0.0.1","deps":[],"cksum":"bae3","features":{"test":["k","q"]},"links":"a-sys"}"#
249    ).unwrap();
250
251    // Now we add escaped cher all the places they can go
252    // these are not valid, but it should error later than json parsing
253    let _: RegistryPackage<'_> = serde_json::from_str(
254        r#"{
255        "name":"This name has a escaped cher in it \n\t\" ",
256        "vers":"0.0.1",
257        "deps":[{
258            "name": " \n\t\" ",
259            "req": " \n\t\" ",
260            "features": [" \n\t\" "],
261            "optional": true,
262            "default_features": true,
263            "target": " \n\t\" ",
264            "kind": " \n\t\" ",
265            "registry": " \n\t\" "
266        }],
267        "cksum":"bae3",
268        "features":{"test \n\t\" ":["k \n\t\" ","q \n\t\" "]},
269        "links":" \n\t\" "}"#,
270    )
271    .unwrap();
272}
273
274#[derive(Deserialize)]
275#[serde(field_identifier, rename_all = "lowercase")]
276enum Field {
277    Name,
278    Vers,
279    Deps,
280    Features,
281    Cksum,
282    Yanked,
283    Links,
284}
285
286#[derive(Deserialize)]
287struct RegistryDependency<'a> {
288    name: InternedString,
289    #[serde(borrow)]
290    req: Cow<'a, str>,
291    features: Vec<InternedString>,
292    optional: bool,
293    default_features: bool,
294    target: Option<Cow<'a, str>>,
295    kind: Option<Cow<'a, str>>,
296    registry: Option<Cow<'a, str>>,
297    package: Option<InternedString>,
298    public: Option<bool>,
299}
300
301impl<'a> RegistryDependency<'a> {
302    /// Converts an encoded dependency in the registry to a cargo dependency
303    pub fn into_dep(self, default: SourceId) -> CargoResult<Dependency> {
304        let RegistryDependency {
305            name,
306            req,
307            mut features,
308            optional,
309            default_features,
310            target,
311            kind,
312            registry,
313            package,
314            public,
315        } = self;
316
317        let id = if let Some(registry) = &registry {
318            SourceId::for_registry(&registry.into_url()?)?
319        } else {
320            default
321        };
322
323        let mut dep = Dependency::parse_no_deprecated(package.unwrap_or(name), Some(&req), id)?;
324        if package.is_some() {
325            dep.set_explicit_name_in_toml(name);
326        }
327        let kind = match kind.as_deref().unwrap_or("") {
328            "dev" => DepKind::Development,
329            "build" => DepKind::Build,
330            _ => DepKind::Normal,
331        };
332
333        let platform = match target {
334            Some(target) => Some(target.parse()?),
335            None => None,
336        };
337
338        // All dependencies are private by default
339        let public = public.unwrap_or(false);
340
341        // Unfortunately older versions of cargo and/or the registry ended up
342        // publishing lots of entries where the features array contained the
343        // empty feature, "", inside. This confuses the resolution process much
344        // later on and these features aren't actually valid, so filter them all
345        // out here.
346        features.retain(|s| !s.is_empty());
347
348        // In index, "registry" is null if it is from the same index.
349        // In Cargo.toml, "registry" is None if it is from the default
350        if !id.is_default_registry() {
351            dep.set_registry_id(id);
352        }
353
354        dep.set_optional(optional)
355            .set_default_features(default_features)
356            .set_features(features)
357            .set_platform(platform)
358            .set_kind(kind)
359            .set_public(public);
360
361        Ok(dep)
362    }
363}
364
365pub trait RegistryData {
366    fn prepare(&self) -> CargoResult<()>;
367    fn index_path(&self) -> &Filesystem;
368    fn load(
369        &self,
370        root: &Path,
371        path: &Path,
372        data: &mut dyn FnMut(&[u8]) -> CargoResult<()>,
373    ) -> CargoResult<()>;
374    fn config(&mut self) -> CargoResult<Option<RegistryConfig>>;
375    fn update_index(&mut self) -> CargoResult<()>;
376    fn download(&mut self, pkg: PackageId, checksum: &str) -> CargoResult<MaybeLock>;
377    fn finish_download(&mut self, pkg: PackageId, checksum: &str, data: &[u8])
378        -> CargoResult<File>;
379
380    fn is_crate_downloaded(&self, _pkg: PackageId) -> bool {
381        true
382    }
383    fn assert_index_locked<'a>(&self, path: &'a Filesystem) -> &'a Path;
384    fn current_version(&self) -> Option<InternedString>;
385}
386
387pub enum MaybeLock {
388    Ready(File),
389    Download { url: String, descriptor: String },
390}
391
392mod index;
393mod local;
394mod remote;
395
396fn short_name(id: SourceId) -> String {
397    let hash = hex::short_hash(&id);
398    let ident = id.url().host_str().unwrap_or("").to_string();
399    format!("{}-{}", ident, hash)
400}
401
402impl<'cfg> RegistrySource<'cfg> {
403    pub fn remote(
404        source_id: SourceId,
405        yanked_whitelist: &HashSet<PackageId>,
406        config: &'cfg Config,
407    ) -> RegistrySource<'cfg> {
408        let name = short_name(source_id);
409        let ops = remote::RemoteRegistry::new(source_id, config, &name);
410        RegistrySource::new(source_id, config, &name, Box::new(ops), yanked_whitelist)
411    }
412
413    pub fn local(
414        source_id: SourceId,
415        path: &Path,
416        yanked_whitelist: &HashSet<PackageId>,
417        config: &'cfg Config,
418    ) -> RegistrySource<'cfg> {
419        let name = short_name(source_id);
420        let ops = local::LocalRegistry::new(path, config, &name);
421        RegistrySource::new(source_id, config, &name, Box::new(ops), yanked_whitelist)
422    }
423
424    fn new(
425        source_id: SourceId,
426        config: &'cfg Config,
427        name: &str,
428        ops: Box<dyn RegistryData + 'cfg>,
429        yanked_whitelist: &HashSet<PackageId>,
430    ) -> RegistrySource<'cfg> {
431        RegistrySource {
432            src_path: config.registry_source_path().join(name),
433            config,
434            source_id,
435            updated: false,
436            index: index::RegistryIndex::new(source_id, ops.index_path(), config),
437            yanked_whitelist: yanked_whitelist.clone(),
438            ops,
439        }
440    }
441
442    /// Decode the configuration stored within the registry.
443    ///
444    /// This requires that the index has been at least checked out.
445    pub fn config(&mut self) -> CargoResult<Option<RegistryConfig>> {
446        self.ops.config()
447    }
448
449    /// Unpacks a downloaded package into a location where it's ready to be
450    /// compiled.
451    ///
452    /// No action is taken if the source looks like it's already unpacked.
453    fn unpack_package(&self, pkg: PackageId, tarball: &File) -> CargoResult<PathBuf> {
454        // The `.cargo-ok` file is used to track if the source is already
455        // unpacked.
456        let package_dir = format!("{}-{}", pkg.name(), pkg.version());
457        let dst = self.src_path.join(&package_dir);
458        dst.create_dir()?;
459        let path = dst.join(PACKAGE_SOURCE_LOCK);
460        let path = self.config.assert_package_cache_locked(&path);
461        let unpack_dir = path.parent().unwrap();
462        if let Ok(meta) = path.metadata() {
463            if meta.len() > 0 {
464                return Ok(unpack_dir.to_path_buf());
465            }
466        }
467        let mut ok = OpenOptions::new()
468            .create(true)
469            .read(true)
470            .write(true)
471            .open(&path)?;
472
473        let gz = GzDecoder::new(tarball);
474        let mut tar = Archive::new(gz);
475        let prefix = unpack_dir.file_name().unwrap();
476        let parent = unpack_dir.parent().unwrap();
477        for entry in tar.entries()? {
478            let mut entry = entry.chain_err(|| "failed to iterate over archive")?;
479            let entry_path = entry
480                .path()
481                .chain_err(|| "failed to read entry path")?
482                .into_owned();
483
484            // We're going to unpack this tarball into the global source
485            // directory, but we want to make sure that it doesn't accidentally
486            // (or maliciously) overwrite source code from other crates. Cargo
487            // itself should never generate a tarball that hits this error, and
488            // crates.io should also block uploads with these sorts of tarballs,
489            // but be extra sure by adding a check here as well.
490            if !entry_path.starts_with(prefix) {
491                anyhow::bail!(
492                    "invalid tarball downloaded, contains \
493                     a file at {:?} which isn't under {:?}",
494                    entry_path,
495                    prefix
496                )
497            }
498
499            // Once that's verified, unpack the entry as usual.
500            entry
501                .unpack_in(parent)
502                .chain_err(|| format!("failed to unpack entry at `{}`", entry_path.display()))?;
503        }
504
505        // Write to the lock file to indicate that unpacking was successful.
506        write!(ok, "ok")?;
507
508        Ok(unpack_dir.to_path_buf())
509    }
510
511    fn do_update(&mut self) -> CargoResult<()> {
512        self.ops.update_index()?;
513        let path = self.ops.index_path();
514        self.index = index::RegistryIndex::new(self.source_id, path, self.config);
515        self.updated = true;
516        Ok(())
517    }
518
519    fn get_pkg(&mut self, package: PackageId, path: &File) -> CargoResult<Package> {
520        let path = self
521            .unpack_package(package, path)
522            .chain_err(|| format!("failed to unpack package `{}`", package))?;
523        let mut src = PathSource::new(&path, self.source_id, self.config);
524        src.update()?;
525        let mut pkg = match src.download(package)? {
526            MaybePackage::Ready(pkg) => pkg,
527            MaybePackage::Download { .. } => unreachable!(),
528        };
529
530        // After we've loaded the package configure its summary's `checksum`
531        // field with the checksum we know for this `PackageId`.
532        let req = VersionReq::exact(package.version());
533        let summary_with_cksum = self
534            .index
535            .summaries(package.name(), &req, &mut *self.ops)?
536            .map(|s| s.summary.clone())
537            .next()
538            .expect("summary not found");
539        if let Some(cksum) = summary_with_cksum.checksum() {
540            pkg.manifest_mut()
541                .summary_mut()
542                .set_checksum(cksum.to_string());
543        }
544
545        Ok(pkg)
546    }
547}
548
549impl<'cfg> Source for RegistrySource<'cfg> {
550    fn query(&mut self, dep: &Dependency, f: &mut dyn FnMut(Summary)) -> CargoResult<()> {
551        // If this is a precise dependency, then it came from a lock file and in
552        // theory the registry is known to contain this version. If, however, we
553        // come back with no summaries, then our registry may need to be
554        // updated, so we fall back to performing a lazy update.
555        if dep.source_id().precise().is_some() && !self.updated {
556            debug!("attempting query without update");
557            let mut called = false;
558            self.index
559                .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, &mut |s| {
560                    if dep.matches(&s) {
561                        called = true;
562                        f(s);
563                    }
564                })?;
565            if called {
566                return Ok(());
567            } else {
568                debug!("falling back to an update");
569                self.do_update()?;
570            }
571        }
572
573        self.index
574            .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, &mut |s| {
575                if dep.matches(&s) {
576                    f(s);
577                }
578            })
579    }
580
581    fn fuzzy_query(&mut self, dep: &Dependency, f: &mut dyn FnMut(Summary)) -> CargoResult<()> {
582        self.index
583            .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, f)
584    }
585
586    fn supports_checksums(&self) -> bool {
587        true
588    }
589
590    fn requires_precise(&self) -> bool {
591        false
592    }
593
594    fn source_id(&self) -> SourceId {
595        self.source_id
596    }
597
598    fn update(&mut self) -> CargoResult<()> {
599        // If we have an imprecise version then we don't know what we're going
600        // to look for, so we always attempt to perform an update here.
601        //
602        // If we have a precise version, then we'll update lazily during the
603        // querying phase. Note that precise in this case is only
604        // `Some("locked")` as other `Some` values indicate a `cargo update
605        // --precise` request
606        if self.source_id.precise() != Some("locked") {
607            self.do_update()?;
608        } else {
609            debug!("skipping update due to locked registry");
610        }
611        Ok(())
612    }
613
614    fn download(&mut self, package: PackageId) -> CargoResult<MaybePackage> {
615        let hash = self.index.hash(package, &mut *self.ops)?;
616        match self.ops.download(package, hash)? {
617            MaybeLock::Ready(file) => self.get_pkg(package, &file).map(MaybePackage::Ready),
618            MaybeLock::Download { url, descriptor } => {
619                Ok(MaybePackage::Download { url, descriptor })
620            }
621        }
622    }
623
624    fn finish_download(&mut self, package: PackageId, data: Vec<u8>) -> CargoResult<Package> {
625        let hash = self.index.hash(package, &mut *self.ops)?;
626        let file = self.ops.finish_download(package, hash, &data)?;
627        self.get_pkg(package, &file)
628    }
629
630    fn fingerprint(&self, pkg: &Package) -> CargoResult<String> {
631        Ok(pkg.package_id().version().to_string())
632    }
633
634    fn describe(&self) -> String {
635        self.source_id.display_index()
636    }
637
638    fn add_to_yanked_whitelist(&mut self, pkgs: &[PackageId]) {
639        self.yanked_whitelist.extend(pkgs);
640    }
641
642    fn is_yanked(&mut self, pkg: PackageId) -> CargoResult<bool> {
643        if !self.updated {
644            self.do_update()?;
645        }
646        self.index.is_yanked(pkg, &mut *self.ops)
647    }
648}