Commit e727081d authored by Tarje.Lavik's avatar Tarje.Lavik
Browse files

Init commit. Need to make some of the docker repositories to submodules

parents
#These configs are used by docker-compose file
LODSPK_BUILD_PATH=./docker-lodspeakr/.
LODSPK_SHARED_FOLDER=../lodspk-marcus
INDEXING_FILES_PATH=./elasticsearch-indexing-files
LOAD_FUSEKI_DATA_ON_START=yesPlease
*.zip filter=lfs diff=lfs merge=lfs -text
/.idea
*.xml
*owl
.DS_Store
*.iml
[submodule "elasticsearch-indexing-files"]
path = elasticsearch-indexing-files
url = git@bitbucket.org:ubbdst/elasticsearch-indexing-files.git
[submodule "onto-momayo"]
path = onto-momayo
url = git@git.app.uib.no:momayo/onto-momayo.git
[submodule "marcus-momayo"]
path = marcus-momayo
url = git@git.app.uib.no:momayo/marcus-momayo.git
# Momayo dockerized
The repository contains 4 Docker services, namely, Docker Fuseki, Docker Elasticsearch, Docker Blackbox and Docker WebApp .
Containers need to be run in a composed environment to create a development environment same as that we have in production as of today.
At the core of the Momayo infrastructure is the master data file `data.rdf-xml.owl`. The path fo this file is `*-momayo/data/data.rdf-xml.owl`. "*" could be `marcus` or `riksantikvaren` or whatever name your dataset have. The infrastructure is meant to take rdf datasets that use CIDOC-CRM and the extensions found in the onto-momayo ontology.
As an example we use the `marcus-momayo` repository as an example. This repository could be forked and renamed to the name of your dataset or institution. `make-momayo` can also be forked and the `marcus-momayo` submodule replaced with your own `*-momayo` repository.
## Architecture
![Alt text](docker-update-marcus.png?raw=true "Class diagrams")
- These containers share the same network. Therefore, they can communicate one another by calling their service names f.esk `http://fuseki:3030`.
- To run the containers, one must be in the directory where `docker-compose.yml` resides.
## Init submodules
We need to clone some submodules.
```bash
git submodule update --init --recursive
```
* `elasticsearch-indexing-files` is needed to index data.
* `onto-momayo` is the Protégé 3.5 project for our ontology
* `marcus-momayo` is the Protégé 3.5 project that contains the following submodules
* `onto-momayo` the ontology also included one level up
* `form-momayo` for the form settings for Protégé
* an empty `data` folder that takes a file called `data.rdf-xml.owl` that is the master data file.
```bash
# Build the environment
docker-compose build
# Run the environment
docker-compose up
# Destroy the environment
docker-compose down
```
## Running Docker fuseki
All other docker containers depends on Docker Fuseki. By default, fuseki uploads data and creates TDB that are then being copied to Elasticsearch Docker. Sometimes user might not want to load these data everytime when docker starts, this can now be achieved by setting the value of `"LOAD_FUSEKI_DATA_ON_START"` to empty in the file `.env`
## The webapp
The webapp use [LODspeakr](https://github.com/alangrafu/lodspeakr). The setup is at the moment somewhat cumbersome and need to be improved.
LODspeakr uses SPARQL and queries the fuseki container. Any frontend that can use SPARQL could replace LODspeakr. An Angular.io app with an custom Express API would make a much better frontend.
## Access
If all is well, Blackbox will be available at `localhost:8080/blackbox`, Elasticseach will be available at `localhost:9200` and Fuseki
at `localhost:3030`
## Data injection to Elasticsearch
To index data, you will have to run scripts [elasticsearch-indexing-files](https://bitbucket.org/ubbdst/elasticsearch-indexing-files).
Note that, since now we are using Admin data, you will have to run the `"marcus-admin"` scripts.
## Edit data with Protégé
UBB have updated the Protégé 3.5 version to "3.6" with some bug fixes and som custom functionality.
Download windows version here: https://github.com/ubbdst/protege-devel-3.5/releases
For Mac there there are som plugins here (some manual copying necessary: https://github.com/ubbdst/protege-owl-plugin/releases
Open the `*-momayo/form-momayo/form-momayo.pprj` to edit.
FROM tomcat:8
ENV BLACKBOX_VERSION 0.66
ENV LOG_DIR = "/var/log/blackbox"
#Copy blackbox war
COPY target/blackbox-${BLACKBOX_VERSION}.war /usr/local/tomcat/webapps/blackbox.war
#Create log directory with it's parents
RUN mkdir -p ${LOG_DIR}
#Install vim to be able to view the logs
#RUN ["apt-get", "update"]
#RUN ["apt-get", "install", "-y", "vim"]
EXPOSE 9300
EXPOSE 8080
#Running docker
#docker build . --tag tomy
#docker run -it --rm -p 8080:8080 tomy
\ No newline at end of file
# Elasticsearch
[Blackbox](https://bitbucket.org/ubbdst/blackbox) elasticsearch docker image. The image is supposed to be run in the
composed environment with Docker Fuseki and Docker Blackbox in place.
The image depends on Docker Fuseki and Docker ES, hence those must exist before running this one.
# Run
To start a basic container, go the path where docker-compose.yml resides and run
```
docker-compose build
docker-compose up
```
## Access
If all is well, Blackbox will be available at `localhost:8080/blackbox`
\ No newline at end of file
#
# Dockernized Development Environment for running multi-container Docker applications
# for the University of Bergen Library.
#
# The variables are defined from config-template.env file
#
version: '3.4'
#Defining services as multi-container Docker applications
services:
fuseki:
build: docker-fuseki/.
networks:
- search
ports:
- 3030:3030
volumes:
- tdb-admin:/data/tdb/admin
- fuseki-databases:/fuseki/databases
- type: bind
source: ./out
target: /staging/data/
- type: bind
source: ./onto-momayo/
target: /staging/ontology/
env_file:
- .env
es:
build: docker-es/.
networks:
- search
ports:
- 9200:9200
- 9300:9300
volumes:
- tdb-admin:/data/tdb/admin
- es-data:/elasticsearch/data
- type: bind
source: ${INDEXING_FILES_PATH}
target: /usr/share/elasticsearch/indexingScripts/
depends_on:
- fuseki
env_file:
- .env
blackbox:
build: docker-blackbox/.
networks:
- search
ports:
- 8080:8080
depends_on:
- fuseki
- es
webapp:
build: ${LODSPK_BUILD_PATH}
networks:
- search
ports:
- 80:80
volumes:
- type: bind
source: ${LODSPK_SHARED_FOLDER}
target: /var/www/html/lodspeakr
depends_on:
- fuseki
- es
- blackbox
env_file:
- .env
volumes:
tdb-admin:
es-data:
fuseki-databases:
networks:
search:
#This Docker is specific to Elasticsearch 1.7
FROM java:8-jre
ENV ES_VERSION 1.7.6
ENV ES_CFG_FILE "/elasticsearch/config/elasticsearch.yml"
ENV ES_HEAP_SIZE 3g
# Install ElasticSearch.
RUN \
cd /tmp && \
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-${ES_VERSION}.tar.gz && \
tar xvzf elasticsearch-${ES_VERSION}.tar.gz && \
rm -f elasticsearch-${ES_VERSION}.tar.gz && \
mv /tmp/elasticsearch-${ES_VERSION} /elasticsearch
#Define mountable directories.
VOLUME ["/elasticsearch/data"]
#Define working directory.
WORKDIR /elasticsearch
EXPOSE 9200
EXPOSE 9300
#Install vim to be able to view logs
RUN ["apt-get", "update"]
RUN ["apt-get", "install", "-y", "vim"]
#Install UBB River plugin
RUN /elasticsearch/bin/plugin --url https://github.com/ubbdst/elasticsearch-rdf-river/releases/download/1.7.6/ubb-rdf-river-plugin-1.7.6.zip --install ubb-river-1.7.6
#Copy cluster config to docker
COPY elasticsearch.yml /elasticsearch/config/
#Copy Norwegian stopwords
COPY stopwords /elasticsearch/config/stopwords
ADD run /usr/local/bin/run
RUN chmod +x /usr/local/bin/run
#CMD run
CMD ./bin/elasticsearch ${ES_CFG_FILE}
#Building Elasticseach docker image
#docker
#docker run --volumes-from fuseki --volume /Users/hru066/Desktop/es-data:/elasticsearch/data --name elasticsearch -p 9200:9200 -p 9300:9300 es
#docker run --volumes-from fuseki --name elasticsearch -p 9200:9200 -p 9300:9300 es
\ No newline at end of file
# Elasticsearch
A [1.7.5](https://www.elastic.co/downloads/past-releases/elasticsearch-1-7-6) elasticsearch docker image. The image is supposed to be run in the
composed environment with Docker Fuseki and Docker Blackbox.
The image depends on Docker Fuseki, hence it must exist before running this one.
# Quick Start
To start a basic container, expose port 9200.
```
docker build . --tag es
docker run --volumes-from fuseki --name elasticsearch -p 9200:9200 -p 9300:9300 es
```
# Persistence
With the container, the volumes `/elasticsearch/data` can be persistent to the host machine.
To start a default container with attached persistent/shared storage for data:
```sh
docker run --volumes-from fuseki --volume /es-data:/elasticsearch/data --name elasticsearch -p 9200:9200 -p 9300:9300 es
```
~
\ No newline at end of file
##################### Elasticsearch Configuration Example #####################
# This file contains an overview of various configuration settings,
# targeted at operations staff. Application developers should
# consult the guide at <http://elasticsearch.org/guide>.
#
# The installation procedure is covered at
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/setup.html>.
#
# Elasticsearch comes with reasonable defaults for most settings,
# so you can try it out without bothering with configuration.
#
# Most of the time, these defaults are just fine for running a production
# cluster. If you're fine-tuning your cluster, or wondering about the
# effect of certain configuration option, please _do ask_ on the
# mailing list or IRC channel [http://elasticsearch.org/community].
# Any element in the configuration can be replaced with environment variables
# by placing them in ${...} notation. For example:
#
#node.rack: ${RACK_ENV_VAR}
# For information on supported formats and syntax for the config file, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html>
################################### Cluster ###################################
# Cluster name identifies your cluster for auto-discovery. If you're running
# multiple clusters on the same network, make sure you're using unique names.
#
cluster.name: ubb-elasticsearch-docker
#################################### Node #####################################
# Node names are generated dynamically on startup, so you're relieved
# from configuring them manually. You can tie this node to a specific name:
#
node.name: "UBB Docker Node"
# Every node can be configured to allow or deny being eligible as the master,
# and to allow or deny to store the data.
#
# Allow this node to be eligible as a master node (enabled by default):
#
node.master: true
#
# Allow this node to store data (enabled by default):
#
node.data: true
# You can exploit these settings to design advanced cluster topologies.
#
# 1. You want this node to never become a master node, only to hold data.
# This will be the "workhorse" of your cluster.
#
#node.master: false
#node.data: true
#
# 2. You want this node to only serve as a master: to not store any data and
# to have free resources. This will be the "coordinator" of your cluster.
#
#node.master: true
#node.data: false
#
# 3. You want this node to be neither master nor data node, but
# to act as a "search load balancer" (fetching data from nodes,
# aggregating results, etc.)
#
#node.master: false
#node.data: false
# Use the Cluster Health API [http://localhost:9200/_cluster/health], the
# Node Info API [http://localhost:9200/_nodes] or GUI tools
# such as <http://www.elasticsearch.org/overview/marvel/>,
# <http://github.com/karmi/elasticsearch-paramedic>,
# <http://github.com/lukas-vlcek/bigdesk> and
# <http://mobz.github.com/elasticsearch-head> to inspect the cluster state.
# A node can have generic attributes associated with it, which can later be used
# for customized shard allocation filtering, or allocation awareness. An attribute
# is a simple key value pair, similar to node.key: value, here is an example:
#
#node.rack: rack314
# By default, multiple nodes are allowed to start from the same installation location
# to disable it, set the following:
#node.max_local_storage_nodes: 1
#################################### Index ####################################
# You can set a number of options (such as shard/replica options, mapping
# or analyzer definitions, translog settings, ...) for indices globally,
# in this file.
#
# Note, that it makes more sense to configure index settings specifically for
# a certain index, either when creating it or by using the index templates API.
#
# See <http://elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html> and
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html>
# for more information.
# Set the number of shards (splits) of an index (5 by default):
#
#index.number_of_shards: 5
# Set the number of replicas (additional copies) of an index (1 by default):
#
#index.number_of_replicas: 1
# Note, that for development on a local machine, with small indices, it usually
# makes sense to "disable" the distributed features:
#
index.number_of_shards: 3
index.number_of_replicas: 0
# These settings directly affect the performance of index and search operations
# in your cluster. Assuming you have enough machines to hold shards and
# replicas, the rule of thumb is:
#
# 1. Having more *shards* enhances the _indexing_ performance and allows to
# _distribute_ a big index across machines.
# 2. Having more *replicas* enhances the _search_ performance and improves the
# cluster _availability_.
#
# The "number_of_shards" is a one-time setting for an index.
#
# The "number_of_replicas" can be increased or decreased anytime,
# by using the Index Update Settings API.
#
# Elasticsearch takes care about load balancing, relocating, gathering the
# results from nodes, etc. Experiment with different settings to fine-tune
# your setup.
# Use the Index Status API (<http://localhost:9200/A/_status>) to inspect
# the index status.
#################################### Paths ####################################
# Path to directory containing configuration (this file and logging.yml):
#
#path.conf: /path/to/conf
# Path to directory where to store index data allocated for this node.
#
#path.data: /path/to/data
#
# Can optionally include more than one location, causing data to be striped across
# the locations (a la RAID 0) on a file level, favouring locations with most free
# space on creation. For example:
#
#path.data: /path/to/data1,/path/to/data2
# Path to temporary files:
#
#path.work: /path/to/work
# Path to log files:
#
#path.logs: /path/to/logs
# Path to where plugins are installed:
#
#path.plugins: /path/to/plugins
#################################### Plugin ###################################
# If a plugin listed here is not installed for current node, the node will not start.
#
#plugin.mandatory: mapper-attachments,lang-groovy
################################### Memory ####################################
# Elasticsearch performs poorly when JVM starts swapping: you should ensure that
# it _never_ swaps.
#
# Set this property to true to lock the memory:
#
bootstrap.mlockall: true
# Make sure that the ES_MIN_MEM and ES_MAX_MEM environment variables are set
# to the same value, and that the machine has enough memory to allocate
# for Elasticsearch, leaving enough memory for the operating system itself.
#
# You should also make sure that the Elasticsearch process is allowed to lock
# the memory, eg. by using `ulimit -l unlimited`.
############################## Network And HTTP ###############################
# Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens
# on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node
# communication. (the range means that if the port is busy, it will automatically
# try the next port).
# Set the bind address specifically (IPv4 or IPv6):
#
#network.bind_host: 192.168.0.1
# Set the address other nodes will use to communicate with this node. If not
# set, it is automatically derived. It must point to an actual IP address.
#
#network.publish_host: 192.168.0.1
# Set both 'bind_host' and 'publish_host':
#
#network.host: 192.168.0.1
# Set a custom port for the node to node communication (9300 by default):
#
#transport.tcp.port: 9300
# Enable compression for all communication between nodes (disabled by default):
#
#transport.tcp.compress: true
# Set a custom port to listen for HTTP traffic:
#
#http.port: 9200
# Set a custom allowed content length:
#
#http.max_content_length: 100mb
# Disable HTTP completely:
#
#http.enabled: false
################################### Gateway ###################################
# The gateway allows for persisting the cluster state between full cluster
# restarts. Every change to the state (such as adding an index) will be stored
# in the gateway, and when the cluster starts up for the first time,
# it will read its state from the gateway.
# There are several types of gateway implementations. For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html>.
# The default gateway type is the "local" gateway (recommended):
#
#gateway.type: local
# Settings below control how and when to start the initial recovery process on
# a full cluster restart (to reuse as much local data as possible when using shared
# gateway).
# Allow recovery process after N nodes in a cluster are up:
#
#gateway.recover_after_nodes: 1
# Set the timeout to initiate the recovery process, once the N nodes
# from previous setting are up (accepts time value):
#
#gateway.recover_after_time: 5m
# Set how many nodes are expected in this cluster. Once these N nodes
# are up (and recover_after_nodes is met), begin recovery process immediately
# (without waiting for recover_after_time to expire):
#
#gateway.expected_nodes: 2
############################# Recovery Throttling #############################
# These settings allow to control the process of shards allocation between
# nodes during initial recovery, replica allocation, rebalancing,
# or when adding and removing nodes.
# Set the number of concurrent recoveries happening on a node:
#
# 1. During the initial recovery
#
#cluster.routing.allocation.node_initial_primaries_recoveries: 4
#
# 2. During adding/removing nodes, rebalancing, etc
#
#cluster.routing.allocation.node_concurrent_recoveries: 2
# Set to throttle throughput when recovering (eg. 100mb, by default 20mb):
#
#indices.recovery.max_bytes_per_sec: 20mb
# Set to limit the number of open concurrent streams when
# recovering a shard from a peer:
#
#indices.recovery.concurrent_streams: 5
################################## Discovery ##################################
# Discovery infrastructure ensures nodes can be found within a cluster
# and master node is elected. Multicast discovery is the default.
# Set to ensure a node sees N other master eligible nodes to be considered
# operational within the cluster. This should be set to a quorum/majority of
# the master-eligible nodes in the cluster.
#
#discovery.zen.minimum_master_nodes: 1
# Set the time to wait for ping responses from other nodes when discovering.
# Set this option to a higher value on a slow or congested network
# to minimize discovery failures:
#
#discovery.zen.ping.timeout: 3s
# For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html>
# Unicast discovery allows to explicitly control which nodes will be used
# to discover the cluster. It can be used when multicast is not present,
# or to restrict the cluster communication-wise.
#
# 1. Disable multicast discovery (enabled by default):
#
#discovery.zen.ping.multicast.enabled: false
#
# 2. Configure an initial list of master nodes in the cluster
# to perform discovery when new nodes (master or data) are started:
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]
# EC2 discovery allows to use AWS EC2 API in order to perform discovery.
#
# You have to install the cloud-aws plugin for enabling the EC2 discovery.