Skip to content

Commit 34bfe5e

Browse files
tco-190 Filter searches before citation detector
** Why are these changes being introduced: There is a transaction cost to calling the citation detector, both in terms of time and processing time. If we can develop a way to flag searches which have zero chance of being a citation, we can cut out this expense by skipping the citation detector. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-190 ** How does this address that need: This moves the extract_features method in Detector::MlCitation into the initialize method, allowing us to quickly check for whether a given term has enough non-zero features to make it worth calling the detector. From our analysis in the TACOS notebooks, we believe that phrases which result in only two non-zero values among all their features will never end up being a citation - and that this threshold will allow us to skip the citation detector in 90% of searches. The filtering is performed in a convenience method named enough_nonzero_values? (naming things is hard). ** Document any side effects to this change: The @Detections instance variable is defined as false at the top of the initialize method, before the first guard clause, so that we get a consistent Boolean value in all conditions. This required one test to change that previously expected a nil.
1 parent 1b0a034 commit 34bfe5e

File tree

5 files changed

+46
-8
lines changed

5 files changed

+46
-8
lines changed

Gemfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,7 @@ group :test do
128128
# Use system testing [https://guides.rubyonrails.org/testing.html#system-testing]
129129
gem 'capybara'
130130
gem 'climate_control'
131+
gem 'mocha'
131132
gem 'selenium-webdriver'
132133
gem 'simplecov'
133134
gem 'simplecov-lcov'

Gemfile.lock

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,8 @@ GEM
237237
matrix (0.4.2)
238238
mini_mime (1.1.5)
239239
minitest (5.25.5)
240+
mocha (2.7.1)
241+
ruby2_keywords (>= 0.0.5)
240242
msgpack (1.8.0)
241243
multi_json (1.15.0)
242244
net-http (0.6.0)
@@ -398,6 +400,7 @@ GEM
398400
rubocop (>= 1.75.0, < 2.0)
399401
rubocop-ast (>= 1.44.0, < 2.0)
400402
ruby-progressbar (1.13.0)
403+
ruby2_keywords (0.0.5)
401404
rubyzip (2.4.1)
402405
sassc (2.4.0)
403406
ffi (~> 1.9)
@@ -518,6 +521,7 @@ DEPENDENCIES
518521
importmap-rails
519522
jbuilder
520523
mitlibraries-theme!
524+
mocha
521525
omniauth
522526
omniauth-rails_csrf_protection
523527
omniauth_openid_connect

app/models/detector/ml_citation.rb

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,15 @@ class MlCitation
77
# For now the initialize method just needs to consult the external lambda.
88
#
99
# @param phrase String. Often a `Term.phrase`.
10-
# @return Nothing intentional. Data is written to Hash `@detections` during processing.
10+
# @return Nothing intentional. Data is written to Boolean `@detections` during processing.
1111
def initialize(phrase)
12+
@detections = false
1213
return unless self.class.expected_env?
1314

14-
response = fetch(phrase)
15+
features = extract_features(phrase)
16+
return unless enough_nonzero_values?(features)
17+
18+
response = fetch(features)
1519
@detections = response unless response == 'Error'
1620
end
1721

@@ -111,10 +115,10 @@ def define_lambda
111115
# define_payload defines the Hash that will be sent to the lambda.
112116
#
113117
# @return Hash
114-
def define_payload(phrase)
118+
def define_payload(features)
115119
{
116120
action: 'predict',
117-
features: extract_features(phrase),
121+
features: features,
118122
challenge_secret: self.class.lambda_secret
119123
}
120124
end
@@ -135,9 +139,9 @@ def extract_features(phrase)
135139
# error handling with the response.
136140
#
137141
# @return Boolean or 'Error'
138-
def fetch(phrase)
142+
def fetch(features)
139143
lambda = define_lambda
140-
payload = define_payload(phrase)
144+
payload = define_payload(features)
141145

142146
response = lambda.post(self.class.lambda_path, payload.to_json)
143147

@@ -151,5 +155,13 @@ def fetch(phrase)
151155
'Error'
152156
end
153157
end
158+
159+
# Enough_nonzero_values? checks that a provided hash contains at least three values which are not zero.
160+
#
161+
# @param hash Hash
162+
# @return Integer
163+
def enough_nonzero_values?(hash)
164+
hash.values.count { |v| v != 0 } >= 3
165+
end
154166
end
155167
end

test/models/detector/ml_citation_test.rb

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,12 +77,32 @@ class MlCitationTest < ActiveSupport::TestCase
7777

7878
assert_instance_of Detector::MlCitation, prediction
7979

80-
assert_nil(prediction.detections)
80+
assert_equal(false, prediction.detections)
8181
end
8282
end
8383
end
8484
end
8585

86+
test 'lookup skips fetching a prediction for search phrases with less than three features' do
87+
Detector::MlCitation.any_instance.expects(:fetch).never
88+
89+
with_enabled_mlcitation do
90+
# This search phrase is expected to have only two non-zero feature values, which based on
91+
# our analyses will never result in a predicted citation.
92+
Detector::MlCitation.new('foobar (2025)')
93+
end
94+
end
95+
96+
test 'lookup does not skip fetching a prediction for search phrases with three or more features' do
97+
Detector::MlCitation.any_instance.expects(:fetch).once
98+
99+
with_enabled_mlcitation do
100+
# This search phrase is expected to have three non-zero feature values, which is the minimum
101+
# number we expect to have any hope of a citation.
102+
Detector::MlCitation.new('foobar (2025) 1234-76')
103+
end
104+
end
105+
86106
# Record method
87107
test 'record does relevant work' do
88108
with_enabled_mlcitation do

test/test_helper.rb

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
ENV['RAILS_ENV'] ||= 'test'
1414
require_relative '../config/environment'
1515
require 'rails/test_help'
16+
require 'mocha/minitest'
1617

1718
VCR.configure do |config|
1819
config.ignore_localhost = false
@@ -124,4 +125,4 @@ def with_enabled_mlcitation
124125
}
125126
ensure
126127
ENV.replace(old_env)
127-
end
128+
end

0 commit comments

Comments
 (0)