JPA Native Query Postgres Full Text Search#
What Is The Full Text Search?#
-
Full Text Searching
(or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document. -
Textual search operators have existed in databases for years. PostgreSQL has
~
,~*
,LIKE
, andILIKE
operators for textual data types, but they lack many essential properties required by modern information systems:-
There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words, e.g.,
satisfies
andsatisfy
. You might miss documents that containsatisfies
, although you probably would like to find them when searching forsatisfy
. It is possible to useOR
to search for multiple derived forms, but this is tedious and error-prone (some words can have several thousand derivatives). -
They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
-
They tend to be slow because there is no index support, so they must process all documents for every search.
-
-
Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
-
Parsing documents into tokens. It is useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently. In principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes. PostgreSQL uses a parser to perform this step. A standard parser is provided, and custom parsers can be created for specific needs.
-
Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s or es in English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching. (In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.) PostgreSQL uses dictionaries to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs.
-
Storing preprocessed documents optimized for searching. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for proximity ranking, so that a document that contains a more “dense” region of query words is assigned a higher rank than one with scattered query words.
-
-
Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:
- Define stop words that should not be indexed.
- Map synonyms to a single word using Ispell.
- Map phrases to a single word using a thesaurus.
- Map different variations of a word to a canonical form using an Ispell dictionary.
- Map different variations of a word to a canonical form using Snowball stemmer rules.
-
A data type
tsvector
is provided for storing preprocessed documents, along with a typetsquery
for representing processed queries. There are many functions and operators available for these data types, the most important of which is the match operator@@,
which we introduce in.Full text searches
can be accelerated using indexes.
What Is A Document?#
-
A document
is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words. -
For searches within PostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole.
-
For example: we have a table
customers
which contains some column as below.
id | address | dob | full_name | gender | phone | |
---|---|---|---|---|---|---|
b94d2cb0-2b55-4191-b946-12dbd117cfa2 | 893 Bogisich Row | 1995-10-10 07:00:00.000 | Maximilian.Renner68@example.net | Abel Rath IV | M | 377-802-6499 |
305920c6-102b-4592-a33d-15657d7ee515 | 57786 Selmer Squares | 1995-10-10 07:00:00.000 | Henri.Schowalter8@example.net | Katrina Schulist | M | 417-341-8058 |
59f3a54c-5bde-4790-952a-3be66bc3f94e | 7668 Schowalter Streets | 1995-10-10 07:00:00.000 | Chadrick_Beier@example.net | Elbert Kris | M | 647-843-2866 |
6f623a35-f4b9-4a51-ad50-91c3206e41f6 | 3805 Estevan Radial | 1995-10-10 07:00:00.000 | Daniela.Quitzon20@example.com | June Rowe | M | 841-709-3392 |
9f686134-e286-47b3-bb61-60bb56f5bad2 | 76029 Lloyd Points | 1995-10-10 07:00:00.000 | Reid_McDermott15@example.com | Sammy King | M | 908-596-0825 |
- Then we will use the PostgresSQL below to create documents which contains combinations values of text columns as
address
,email
,full_nanme
andphone
.
1 2 3 |
|
- After execute the query we have the result.
document |
---|
893 Bogisich Row Maximilian.Renner68@example.net Abel Rath IV 377-802-6499 |
57786 Selmer Squares Henri.Schowalter8@example.net Katrina Schulist 417-341-8058 |
7668 Schowalter Streets Chadrick_Beier@example.net Elbert Kris 647-843-2866 |
3805 Estevan Radial Daniela.Quitzon20@example.com June Rowe 841-709-3392 |
76029 Lloyd Points Reid_McDermott15@example.com Sammy King 908-596-0825 |
-
Now with these documents we can use them for creating
tsvector
.
What Is The tsvector
Datatype?#
PostgreSQL
provides two data types that are designed to support full text search, which is the activity of searching through a collection of natural-language documents to locate those that best match a query. Thetsvector
type represents a document in a form optimized for text search; thetsquery
type similarly represents a text query.- A
tsvector
value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details). Sorting and duplicate-elimination are done automatically during input, as shown in this example:
1 |
|
tsvector |
---|
'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat' |
-
It is important to understand that the
tsvector
type itself does not perform any word normalization; it assumes the words it is given are normalized appropriately for the application. -
For most English-text-searching applications the above words would be considered non-normalized, but
tsvector
doesn't care. Raw document text should usually be passed throughto_tsvector
to normalize the words appropriately for searching.
What Is The to_tsvector
function?#
- To implement full text searching there must be a function to create a tsvector from a document and a tsquery from a user query. Also, we need to return results in a useful order, so we need a function that compares documents with respect to their relevance to the query. It's also important to be able to display the results nicely. PostgreSQL provides support for all of these functions.
PostgreSQL
provides the functionto_tsvector
for converting adocument
to thetsvector
data type.
1 |
|
to_tsvector
parses a textual document into tokens, reduces the tokens to lexemes, and returns atsvector
which lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration. Here is a simple example:
1 |
|
to_tsvector |
---|
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 |
-
In the example above we see that the resulting
tsvector
does not contain the wordsa
,on
andand
the wordrats
becamerat
and the punctuation sign-
was ignored. -
The
to_tsvector
function internally calls a parser which breaks thedocument
text intotokens
and assigns a type to each token. For eachtoken
, a list ofdictionaries
is consulted, where the list can vary depending on thetoken
type. The first dictionary that recognizes thetoken
emits one or more normalized lexemes to represent thetoken
. For example,rats
becamerat
because one of the dictionaries recognized that the wordrats
is a plural form ofrat
. Some words are recognized as stop words, which causes them to be ignored since they occur too frequently to be useful in searching. In our example these area
,on
, andand
. If no dictionary in the list recognizes the token then it is also ignored. In this example that happened to the punctuation sign-
because there are in fact no dictionaries assigned for its token type (Space symbols), meaning space tokens will never be indexed. The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration. It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configurationenglish
for theEnglish
language.
What Is The tsquery
Datatype?#
- A
tsquery
value stores lexemes that are to be searched for, and can combine them using the Boolean operators&
(AND),|
(OR), and!
(NOT), as well as the phrase search operator<->
(FOLLOWED BY). There is also a variant<N>
of the FOLLOWED BY operator, whereN
is an integer constant that specifies the distance between the two lexemes being searched for.<->
is equivalent to<1>
. - Parentheses can be used to enforce grouping of these operators. In the absence of parentheses,
!
(NOT) binds most tightly,<->
(FOLLOWED BY) next most tightly, then&
(AND), with|
(OR) binding the least tightly.
1 |
|
tsquery |
---|
'fat' & 'rat' |
1 |
|
1 2 3 |
|
1 |
|
tsquery |
---|
'fat' & 'rat' & !'cat' |
- Searching for phrases is possible with the help of the
<->
(FOLLOWED BY)tsquery
operator, which matches only if its arguments have matches that are adjacent and in the given order. For example:
1 |
|
tsquery |
---|
'fatal' <-> 'error' |
- Also, lexemes in a
tsquery
can be labeled with*
to specify prefix matching:
1 |
|
tsquery |
---|
'super':* |
What Are Supported Functions For tsquery
?#
-
In PostgreSQL,
to_tsquery
,plainto_tsquery
, andphraseto_tsquery
are functions related to full-text search. These functions are used to construct query strings in the tsquery format, which is then used in conjunction with the@@
operator to perform full-text searches using thetsvector
data type. -
to_tsquery
Function:- Syntax:
to_tsquery(config_name, query_text)
- Description: Converts a query string into a
tsquery
data type, which is used for full-text searching. Theconfig_name
parameter specifies the text search configuration to use, andquery_text
is the actual query string.
- Syntax:
-
Example:
1 |
|
to_tsquery |
---|
'fat' & 'rat' |
-
plainto_tsquery
Function:- Syntax:
plainto_tsquery(config_name, query_text)
- Description: Converts a plain text query string into a
tsquery
. It is similar toto_tsquery
, but it processes the query text in a simpler way.
- Syntax:
-
Example:
1 |
|
plainto_tsquery |
---|
'cat' & 'ate' & 'rat' |
-
phraseto_tsquery
Function:- Syntax:
phraseto_tsquery(config_name, query_text)
- Description: Similar to
plainto_tsquery
, but it processes the query text as a phrase, considering the order of the words in the input.
- Syntax:
-
Example:
1 |
|
to_tsquery |
---|
'fatal' <-> 'error' |
- There is a more general version of the FOLLOWED BY operator having the form
<N>
, whereN
is an integer standing for the difference between the positions of the matching lexemes.<1>
is the same as<->
, while<2>
allows exactly one other lexeme to appear between the matches, and so on. Thephraseto_tsquery
function makes use of this operator to construct atsquery
that can match a multi-word phrase when some of the words are stop words. For example:
1 |
|
phraseto_tsquery |
---|
'cat' <-> 'ate' <2> 'rat' |
Basic Text Matching#
Full text searching
in PostgreSQL is based on the match operator@@
, which returnstrue
if atsvector
(document) matches atsquery
(query). It doesn't matter which data type is written first.
1 |
|
?column? |
---|
true |
- As the above example suggests, a
tsquery
is not just raw text, any more than atsvector
is. Atsquery
contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. There are functionsto_tsquery
,plainto_tsquery
, andphraseto_tsquery
that are helpful in converting user-written text into a propertsquery
, primarily by normalizing words appearing in the text. Similarly,to_tsvector
is used to parse and normalize a document string. So in practice a text search match would look more like this:
1 |
|
?column? |
---|
true |
- The
@@
operator also supportstext
input, allowing explicit conversion of a text string totsvector
ortsquery
to be skipped in simple cases. The variants available are:
1 2 3 4 |
|
-
The first two of these we saw already. The form
text
@@
tsquery
is equivalent toto_tsvector(x) @@ y
. The formtext
@@
text
is equivalent toto_tsvector(x) @@ plainto_tsquery(y)
. -
Within a
tsquery
, the&
(AND) operator specifies that both its arguments must appear in the document to have a match. Similarly, the|
(OR) operator specifies that at least one of its arguments must appear, while the!
(NOT) operator specifies that its argument must not appear in order to have a match. For example, the queryfat & ! rat
matches documents that containfat
but notrat
.
Example With JPA#
- Now, let's take an example for applying PosgresSQL full text search with JPA.
Dependency#
- Let's create an sample project and add these dependencies below.
pom.xml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
|
- Then let's apply the plugin below for OpenApi generator and Mapstruct
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
OpenApi#
- Now, in the folder
resource
, we will create a folderopenapi
and aopenapi-server.yaml
inside it. The content of this file will look like below with some apis.
openapi-server.yml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
|
- Now, we can use command
mvn clean install
to build and generate apis and model fromopenapi-server.yaml
.
Entity#
- Then let's define a
CustomerEnitty
as below.
CustomerEntity | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
Gender.java | |
---|---|
1 2 3 4 5 6 7 |
|
Repository#
- Now, lets create the Repository for the CustomerEntity above.
CustomerRepository.java | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
- As you can see in this Repository we will define a method to search customers using native query with some params such as
keyword
,pageSize
andpageNumber
.
Dto#
- Next, let's create DTOs for CustomerEntity and
CustomerFilterResult
as below.
Customer.java | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
CustomerFilterResult.java | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Mapper#
- Next, let's create a mapper interface for mapping Customer DTO, Entity and CustomerFilterResult models using MapStruct.
AutoCustomerMapper.java | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
Service#
- Now, with these classes that we defined above, we can create service class for handling logics for CRUD entities.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
|
Controller#
- Now, we can create some basic controllers and implement the generated apis.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
|
Testing#
- Now, let's start the application then use the postman and with the request body as below.
1 2 3 4 5 6 7 8 |
|
- Postman supports us to random some common fields every time we send the request body.
- Now, let's run the api CreateCustomer for 10000 times. Then we will have 10000 records in the
customers
table.
- Next, let's continue use postman for searching customer api as below.
- As you can see the search query will look up the keyword
Knolls
orKnoll
in columnsfull_name
,email
,address
,dob
,phone
andgender
. If any column contains this keyword then it will be return. Currently we have 10 records in 9994 records contains this keyword.