The import process

The process to import data into Solr is relatively straightforward. There’s a SearchEntity object for each entity type that can be imported which keeps track of the indexable fields and the model in mbdata for that entity type.

Once its known which entity types will be imported, sir.indexing._multiprocessed_import() will successivey spawn multiprocessing.Process es via multiprocessing.pool . Each of the processes will retrieve one batch of entities from the database via a query built from build_entity_query() and convert them into regular dicts via query_result_to_dict(). The result of the conversion will be passed into a multiprocessing.Queue. On the other end of the queue, another process running sir.indexing.queue_to_solr() will send them to Solr in batches.

digraph indexing {
graph [rankdir=TB]

subgraph cluster_processes {
    graph [rankdir=LR]
    p_n [label="Process n"]
    p_dot [label="Process ..."]
    p_2 [label="Process #2"]
    p_1 [label="Process #1"]
    color = lightgrey
}

mb [label="MusicBrainz DB"]
push_proc [label="Push process"]
queue [label="Data queue" shape=diamond]
solr [label="Solr server"]

mb -> p_1;
mb -> p_2;
mb -> p_dot;
mb -> p_n;
p_n -> queue;
p_dot -> queue;
p_1 -> queue;
p_2 -> queue;
queue -> push_proc;
push_proc -> solr;
}

Paths

Each SearchEntity is assigned a declarative via its model attribute and a collection of SearchField objects, each corresponding to a field in the entities Solr core. Those fields each have one or more paths that “lead” to the values that will be put into the field in Solr. iterate_path_values() is a method that returns an iterator over all values for a specific field from an instance of a declarative class and its docstring describes how that works, so here’s a verbatim copy of it:

querying.iterate_path_values(obj)

Return an iterator over all values for path on obj, an instance of a declarative class by first splitting the path into its elements by splitting it on the dots, resulting in a list of path elements. Then, for each element, a call to getattr() is made - the arguments will be the current model (which initially is the model assigned to the SearchEntity) and the current path element. After doing this, there are two cases:

  1. The path element is not the last one in the path. In this case, the getattr() call returns one or more objects of another model which will replace the current one.
  2. The path element is the last one in the path. In this case, the value returned by the getattr() call will be returned and added to the list of values for this field.

To give an example, lets presume the object we’re starting with is an instance of Artist and the path is “begin_area.name”. The first getattr() call will be:

getattr(obj, "begin_area")

which returns an Area object, on which the call:

getattr(obj, "name")

will return the final value:

>>> from mbdata.models import Artist, Area
>>> artist = Artist(name="Johan de Meij")
>>> area = Area(name="Netherlands")
>>> artist.begin_area = area
>>> list(iterate_path_values("begin_area.name", artist))
['Netherlands']

One-to-many relationships will of course be handled as well:

>>> from mbdata.models import Recording, ISRC
>>> recording = Recording(name="Fortuna Imperatrix Mundi: O Fortuna")
>>> recording.isrcs.append(ISRC(isrc="DEF056730100"))
>>> recording.isrcs.append(ISRC(isrc="DEF056730101"))
>>> list(iterate_path_values("isrcs.isrc", recording))
['DEF056730100', 'DEF056730101']

sir.schema.SCHEMA is a dictionary mapping core names to SearchEntity objects.