Single machine deployment tutorial
The repository includes a "Single Machine" Docker Compose configuration which brings up the FHIR Pipelines Controller plus a Spark Thrift server, letting you more easily run Spark SQL queries on the Parquet files output by the Pipelines Controller.
To learn how the Pipelines Controller works on its own, try out the FHIR Pipelines Controller.
Requirements
- A source HAPI FHIR server configured
to use Postgres as its database
- If you don't have a server, use a local test server by following the instructions to bring up a source HAPI FHIR server with Postgres
- Docker
- If you are using Linux, Docker must be in sudoless mode
- Docker Compose - this guide assumes you are using the latest version
- The FHIR Data Pipes repository, cloned onto the host machine
Configure the FHIR Pipelines Controller
All file paths are relative to the root directory of the FHIR Data Pipes repository.
NOTE: You need to configure only one of the following options:
- Open
docker/config/application.yaml
and edit the parameterfhirServerUrl
to match the FHIR server you are connecting to. - Comment out the paramter:
dbConfig
.
- Open
docker/config/application.yaml
and comment out the paramter:fhirServerUrl
- Set
dbConfig
to the DB connection config file, e.g.docker/config/hapi-postgres-config_local.json
. - Edit the json values in this file to match the database for the FHIR server you are connecting to.
Parquet Files and Flattened Views
FHIR Data Pipes supports the option to transform FHIR data into an RDBMS (as the data-warehouse) using FHIR View Definition resources to define the custom schema. This option in our tutorial will be specific to PostgreSQL, but this feature is compatible with any RDBMS.
The concept of "flattened views" comes from applying ViewDefinitions to FHIR Resources to get flat tables. Please check Schema and Flat Views to learn more.
With the default config, you will create both Parquet files (
under dwhRootPrefix
) and flattened views in the relational database configured
by sinkDbConfigPath
here
(The sinkDbConfigPath
refers to the target DB that will become the
data-warehouse)
If you do not want flat tables in a sink DB, comment out that parameter!
If you do need them, make sure you create the DB referenced in the connection config file with the following SQL query:
Run this query in Postgres:
For documentation of all config parameters, see here.
Note: If you are using the local test servers, things should work with the default values. If not, use the IP address of the Docker default bridge network. To find it, run the following command and use the "Gateway" value:
The Single Machine docker configuration uses two environment
variables, DWH_ROOT
and PIPELINE_CONFIG
, whose default values are defined in
the .env
file.
To override them: set the desired variable before running
the docker-compose
command. For example, to override the DWH_ROOT
environment variable, run the following:
DWH_ROOT="$(pwd)/<path_to_dwh_directory>" docker compose -f docker/compose-controller-spark-sql-single.yaml up --force-recreate
Run the Single Machine Configuration
To bring up
the docker/compose-controller-spark-sql-single.yaml
configuration for the first time or if you have run this container in the past
and want to include new changes pulled into the repo, run:
Alternatively, to run without rebuilding use:
Alternatively, docker/compose-controller-spark-sql.yaml
serves as a very simple example on how to integrate the Parquet output of
Pipelines in a Spark cluster environment.
Once started, the Pipelines Controller is available at http://localhost:8090
and the Spark Thrift server is at http://localhost:10001
.
The first time you run the Pipelines Controller, you are required to start a
Full Pipeline run. In a browser go to http://localhost:8090
and click the
Run Full button.
After running the Full Pipeline, use the Incremental Pipeline to update the Parquet files and tables. By default, it is scheduled to run every hour, or you can manually trigger it.
If the Incremental Pipeline does not work, or you see errors like:
ERROR o.openmrs.analytics.PipelineManager o.openmrs.analytics.PipelineManager$PipelineThread.run:343 - exception while running pipeline:
pipeline-controller | java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
Try running sudo chmod -R 755
on the Parquet file directory, by default
located at docker/dwh
.
Explore the resulting schema in PostgreSQL
If you set the sinkDbConfigPath
parameter to output views to a PostgreSQL DWH,
connect to the DB via docker using this
cmd: docker exec -it <container_name_or_id> bash
If using the default container (hapi-fhir-db)
run: docker exec -it hapi_fhir_db bash
Using psql connect to the 'views' database: psql -U admin -d views
To list the tables: \d
. It should look something like this:
Schema | Name | Type | Owner |
---|---|---|---|
public | condition_flat | table | admin |
public | diagnostic_report_flat | table | admin |
public | immunization_flat | table | admin |
public | location_flat | table | admin |
public | medication_request_flat | table | admin |
public | observation_flat | table | admin |
public | organization_flat | table | admin |
public | practitioner_flat | table | admin |
public | practitioner_role_flat | table | admin |
public | procedure_flat | table | admin |
View and Analyze the Data
Let's do some basic quality checks to make sure the data is uploaded properly (note table names are case-insensitive).
Connect to the Spark Thrift server using a client that supports Apache Hive.
For example, if using the JDBC driver, the URL should be jdbc:hive2://localhost:10001
.
The pipeline will automatically create Patient
, Encounter
, and Observation
tables when run.
Doing the same for observations:
Connect to your sink DB with the database client of your choice.
Note: You will see that the number of patients and observations is higher than the count in the Thrift Server. This is due to the flattening
We should have exactly 114 patients:Let's do the same for observations:
What's Next
Now that the data is available in an SQL queryable format, you can start to explore it using SQL or jupyter notebooks or build dashboards using common open source tools like Apache SuperSet.