In this quick-start we will create index for searching in WikiBooks. There are two essential parts, Summa server responsible for indexing text data and Summa client that is required for communicating with Summa server.
Although there is a GRPC API you may want to use through tools like grpcurl
, here we will use Summa client implemented in Python.
Install
Prerequisite:
summa-server
is distributed as a prebuilt Docker image hosted on Dockerhub, or may be build from sources. Summa exposes its APIs through GRPC what makes available it to use in all languages having GRPC client libraries. Additionally, there is an aiosumma
Python packages that provides Python client and CLI.
Summa Server
We are going to pull and launch summa-server
through Docker. Pulling can be done by docker pull
# Pull actual image for `summa-server`
docker pull izihawa/summa-server:testing
# Create local directory for storing index
mkdir data
# Generate config for `summa-server`
# -a flag is for setting listen address of GRPC API
docker run izihawa/summa-server:testing generate-config -d /data \
-a 0.0.0.0:8082 > summa.yaml
# Launch `summa-server`
docker run -v $(pwd)/summa.yaml:/summa.yaml -v $(pwd)/data:/data -p 8082:8082 \
izihawa/summa-server:testing serve /summa.yaml
After the last command you should see starting logs of summa-server
, something like
2022-11-17T16:14:00.712450Z INFO main lifecycle: summa_server::servers::metrics: action="binded" endpoint="0.0.0.0:8084"
2022-11-17T16:14:00.714536Z INFO main lifecycle: summa_server::servers::grpc: action="binded" endpoint="0.0.0.0:8082"
2022-11-17T16:14:00.752511Z INFO main summa_server::services::index_service: action="index_holders" index_holders={}
Aiosumma
aiosumma
is a Python package for using Summa GRPC API from Python and Terminal. Let’s install it:
# (Optional) Create virtual env for `aiosumma`
python3 -m venv venv
source venv/bin/acticate
# Install aiosumma
pip3 install -U aiosumma
grpcurl
You may also use curl
-alike tool for reaching summa-server
though Terminal. You may download its binary from their repository or install through brew on MacOS: brew install grpcurl
Create Index
Summa is a schemaful search engines. It requires from you to define fields what you are going to use. Let’s create a schema for WikiBooks:
# Create index schema in file
cat << EOF > schema.yaml
---
# yamllint disable rule:key-ordering
blocksize: 131072
compression: Zstd
index_name: books
index_attributes:
conflict_strategy: OVERWRITE_ALWAYS
description: Wiki
multi_fields: ["category"]
index_engine:
file: {}
schema: >
- name: category
type: text
options:
indexing:
fieldnorms: true
record: position
tokenizer: default
stored: true
- name: content_model
type: text
options:
indexing:
fieldnorms: true
record: basic
tokenizer: default
stored: true
- name: opening_text
type: text
options:
indexing:
fieldnorms: true
record: position
tokenizer: default
stored: true
- name: auxiliary_text
type: text
options:
indexing:
fieldnorms: true
record: position
tokenizer: default
stored: true
- name: language
type: text
options:
indexing:
fieldnorms: true
record: basic
tokenizer: default
stored: true
- name: title
type: text
options:
indexing:
fieldnorms: true
record: position
tokenizer: default
stored: true
- name: text
type: text
options:
indexing:
fieldnorms: true
record: position
tokenizer: default
stored: true
- name: timestamp
type: date
options:
fast: true
fieldnorms: false
indexed: true
stored: true
- name: create_timestamp
type: date
options:
fast: true
fieldnorms: false
indexed: true
stored: true
- name: popularity_score
type: f64
options:
fast: true
fieldnorms: false
indexed: true
stored: true
- name: incoming_links
type: u64
options:
fast: true
fieldnorms: false
indexed: true
stored: true
- name: namespace
type: u64
options:
fast: true
fieldnorms: false
indexed: true
stored: true
EOF
# Create index
summa-cli localhost:8082 - create-index-from-file schema.yaml
Add Documents
WikiBooks provides weekly dumps of their books’ database. Let’s download their dump and index it in Summa:
# Download sample dataset
CURRENT_DUMP=$(curl -s -L "https://dumps.wikimedia.org/other/cirrussearch/current" | grep -oh '\"enwikibooks.*\content.json\.gz\"' | tr -d '"')
wget "https://dumps.wikimedia.org/other/cirrussearch/current/$CURRENT_DUMP" -O enwikibooks.json.gz
gunzip enwikibooks.json.gz
# Upload a half of documents to Summa. You can upload remaining half by setting `awk 'NR%4==2'`
# It will take a while depending on the performance of your computer
awk 'NR%4==0' enwikibooks.json | summa-cli localhost:8082 - index-document-stream books
# Commit index to make them searchable
summa-cli localhost:8082 - commit-index books
Well, we have WikiBooks database indexed locally.
Search
Let’s do a test query:
# Do a match query that returns top-10 documents and its total count
summa-cli 0.0.0.0:8082 search '[{"index_alias": "books", "query": {"match": {"value": "astronomy"}}, "collectors": [{"top_docs": {"limit": 10}}, {"count": {}}]}]'
You will see response containing found documents.