Quick Start

In this quick-start we will create index for searching in WikiBooks. There are two essential parts, Summa server responsible for indexing text data and Summa client that is required for communicating with Summa server.

Although there is a GRPC API you may want to use through tools like grpcurl, here we will use Summa client implemented in Python.

Install

Prerequisite:

Python3 or grpcurl
Docker

summa-server is distributed as a prebuilt Docker image hosted on Dockerhub, or may be build from sources. Summa exposes its APIs through GRPC what makes available it to use in all languages having GRPC client libraries. Additionally, there is an aiosumma Python packages that provides Python client and CLI.

Summa Server

We are going to pull and launch summa-server through Docker. Pulling can be done by docker pull

# Pull actual image for `summa-server`
docker pull izihawa/summa-server:testing

# Create local directory for storing index
mkdir data

# Generate config for `summa-server`
# -a flag is for setting listen address of GRPC API
docker run izihawa/summa-server:testing generate-config -d /data \
-a 0.0.0.0:8082 > summa.yaml

# Launch `summa-server`
docker run -v $(pwd)/summa.yaml:/summa.yaml -v $(pwd)/data:/data -p 8082:8082 \
izihawa/summa-server:testing serve /summa.yaml

After the last command you should see starting logs of summa-server, something like

2022-11-17T16:14:00.712450Z  INFO main lifecycle: summa_server::servers::metrics: action="binded" endpoint="0.0.0.0:8084"
2022-11-17T16:14:00.714536Z  INFO main lifecycle: summa_server::servers::grpc: action="binded" endpoint="0.0.0.0:8082"
2022-11-17T16:14:00.752511Z  INFO main summa_server::services::index_service: action="index_holders" index_holders={}

Aiosumma

aiosumma is a Python package for using Summa GRPC API from Python and Terminal. Let’s install it:

# (Optional) Create virtual env for `aiosumma`
python3 -m venv venv
source venv/bin/acticate

# Install aiosumma
pip3 install -U aiosumma

grpcurl

You may also use curl-alike tool for reaching summa-server though Terminal. You may download its binary from their repository or install through brew on MacOS: brew install grpcurl

Create Index

Summa is a schemaful search engines. It requires from you to define fields what you are going to use. Let’s create a schema for WikiBooks:

# Create index schema in file
cat << EOF > schema.yaml
---
# yamllint disable rule:key-ordering
blocksize: 131072
compression: Zstd
index_name: books
index_attributes:
  conflict_strategy: OVERWRITE_ALWAYS
  description: Wiki
  multi_fields: ["category"]
index_engine:
  file: {}
schema: >
  - name: category
    type: text
    options:
      indexing:
        fieldnorms: true
        record: position
        tokenizer: default
      stored: true
  - name: content_model
    type: text
    options:
      indexing:
        fieldnorms: true
        record: basic
        tokenizer: default
      stored: true
  - name: opening_text
    type: text
    options:
      indexing:
        fieldnorms: true
        record: position
        tokenizer: default
      stored: true
  - name: auxiliary_text
    type: text
    options:
      indexing:
        fieldnorms: true
        record: position
        tokenizer: default
      stored: true
  - name: language
    type: text
    options:
      indexing:
        fieldnorms: true
        record: basic
        tokenizer: default
      stored: true
  - name: title
    type: text
    options:
      indexing:
        fieldnorms: true
        record: position
        tokenizer: default
      stored: true
  - name: text
    type: text
    options:
      indexing:
        fieldnorms: true
        record: position
        tokenizer: default
      stored: true
  - name: timestamp
    type: date
    options:
      fast: true
      fieldnorms: false
      indexed: true
      stored: true
  - name: create_timestamp
    type: date
    options:
      fast: true
      fieldnorms: false
      indexed: true
      stored: true
  - name: popularity_score
    type: f64
    options:
      fast: true
      fieldnorms: false
      indexed: true
      stored: true
  - name: incoming_links
    type: u64
    options:
      fast: true
      fieldnorms: false
      indexed: true
      stored: true
  - name: namespace
    type: u64
    options:
      fast: true
      fieldnorms: false
      indexed: true
      stored: true

EOF

# Create index
summa-cli localhost:8082 - create-index-from-file schema.yaml

Add Documents

WikiBooks provides weekly dumps of their books’ database. Let’s download their dump and index it in Summa:

# Download sample dataset
CURRENT_DUMP=$(curl -s -L "https://dumps.wikimedia.org/other/cirrussearch/current" | grep -oh '\"enwikibooks.*\content.json\.gz\"' | tr -d '"')
wget "https://dumps.wikimedia.org/other/cirrussearch/current/$CURRENT_DUMP" -O enwikibooks.json.gz
gunzip enwikibooks.json.gz

# Upload a half of documents to Summa. You can upload remaining half by setting `awk 'NR%4==2'`
# It will take a while depending on the performance of your computer
awk 'NR%4==0' enwikibooks.json | summa-cli localhost:8082 - index-document-stream books

# Commit index to make them searchable
summa-cli localhost:8082 - commit-index books

Well, we have WikiBooks database indexed locally.

Search

Let’s do a test query:

# Do a match query that returns top-10 documents and its total count
summa-cli 0.0.0.0:8082 search '{"index_alias": "books", "query": {"match": {"value": "astronomy"}}, "collectors": [{"top_docs": {"limit": 10}}, {"count": {}}]}'

You will see response containing found documents.